This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

A Low-Rank Augmented Lagrangian Method
for Large-Scale Semidefinite Programming
Based on a Hybrid Convex-Nonconvex Approach

Renato D.C. Monteiro Stewart School of Industrial and Systems Engineering, Georgia Institute of Technology, Atlanta, GA, 30332-0205. (Email: [email protected] & [email protected]). These authors were partially supported by AFORS Grant FA9550-22-1-0088.    Arnesh Sujanani 11footnotemark: 1    Diego Cifuentes Stewart School of Industrial and Systems Engineering, Georgia Institute of Technology, Atlanta, GA, 30332-0205. (Email: [email protected]). This author was supported partially supported by the Office of Naval Research, N00014-23-1-2631.
(January 22, 2024 (second version: March 15, 2024))
Abstract

This paper introduces HALLaR, a new first-order method for solving large-scale semidefinite programs (SDPs) with bounded domain. HALLaR is an inexact augmented Lagrangian (AL) method where the AL subproblems are solved by a novel hybrid low-rank (HLR) method. The recipe behind HLR is based on two key ingredients: 1) an adaptive inexact proximal point method with inner acceleration; 2) Frank-Wolfe steps to escape from spurious local stationary points. In contrast to the low-rank method of Burer and Monteiro, HALLaR finds a near-optimal solution (with provable complexity bounds) of SDP instances satisfying strong duality. Computational results comparing HALLaR to state-of-the-art solvers on several large SDP instances arising from maximum stable set, phase retrieval, and matrix completion, show that the former finds highly accurate solutions in substantially less CPU time than the latter ones. For example, in less than 20 minutes, HALLaR can solve a maximum stable set SDP instance with dimension pair (n,m)(106,107)(n,m)\approx(10^{6},10^{7}) within 10510^{-5} relative precision.

Keywords: semidefinite programming, augmented Lagrangian, low-rank methods, proximal point method, Frank-Wolfe method, iteration complexity, adaptive method, global convergence rate

1   Introduction

Semidefinite programming (SDP) has many applications in engineering, machine learning, sciences, finance, among other areas. However, solving large-scale SDPs is very computationally challenging. In particular, interior point methods usually get stalled in large-scale instances due to lack of memory. This has motivated a recent surge of first-order methods for solving SDPs that scale to larger instances [65, 60, 45, 25, 18, 66, 55, 62, 64, 50].

This paper introduces HALLaR, a new first-order method for solving SDPs with bounded trace. Let 𝕊n\mathbb{S}^{n} be the space of symmetric n×nn\times n matrices with Frobenius inner product \bullet and with positive semidefinite partial order \succeq. HALLaR solves the primal/dual pair of SDPs:

minX{CX:𝒜X=b,XΔn}\displaystyle\min_{X}\quad\{C\bullet X\quad:\quad\mathcal{A}X=b,\quad X\in\Delta^{n}\} (P)
maxpm,θ{bTpθ:S:=C+𝒜p+θI0,θ0}\displaystyle\max_{p\in{\mathbb{R}}^{m},\theta\in{\mathbb{R}}}\quad\{-b^{T}p-\theta\quad:\quad S:=C+\mathcal{A}^{*}p+\theta I\succeq 0,\quad\theta\geq 0\} (D)

where bmb\in{\mathbb{R}}^{m}, C𝕊nC\in\mathbb{S}^{n}, 𝒜:𝕊nm\mathcal{A}:\mathbb{S}^{n}\to{\mathbb{R}}^{m} is a linear map, 𝒜:m𝕊n\mathcal{A}^{*}:{\mathbb{R}}^{m}\to\mathbb{S}^{n} is its adjoint, and Δn\Delta^{n} is the spectraplex

Δn:={X𝕊n:trX1,X0}.\Delta^{n}:=\{X\in\mathbb{S}^{n}:\operatorname{tr}X\leq 1,X\succeq 0\}. (1)

HALLaR is based on Burer and Monteiro’s low-rank (LR) approach [7, 8] which is described in the next paragraph.

Low-rank approach.

The LR approach is motivated by SDPs often having optimal solutions with small ranks. More specifically, it is known (see [47, 2, 54]) that r2mr_{*}\leq\sqrt{2m}, where rr_{*} is the smallest among the ranks of all optimal solutions of (P). The LR approach consists of solving subproblems obtained by restricting (P) to matrices of rank at most rr, for some integer rr, or equivalently, the nonconvex smooth reformulation

minU{CUUT:𝒜(UUT)=b,UF1,Un×r}.\displaystyle\min_{U}\quad\left\{C\bullet UU^{T}\quad:\quad\mathcal{A}(UU^{T})=b,\quad\|U\|_{F}\leq 1,\quad U\in{\mathbb{R}}^{n\times r}\right\}. (PrP_{r})

Problems (P) and (PrP_{r}) are equivalent when rrr\geq r_{*}, in the sense that if UU_{*} is optimal for (PrP_{r}) then X=UUTX_{*}=U_{*}U_{*}^{T} is optimal for (P). The advantage of (PrP_{r}) compared to (P) is that its matrix variable UU has significantly less entries than that of (P) when rnr\ll n, namely, nrnr instead of n(n+1)/2n(n+1)/2. However, as (PrP_{r}) is nonconvex, it may have stationary points which are not globally optimal. For a generic instance, the following results are known: i) if r2mr\geq\sqrt{2m} then all local minima of (PrP_{r}) are globally optimal (see [14, 15, 4, 5, 3, 48]); and ii) if r<2mr<\sqrt{2m} then (PrP_{r}) may have local minima which are not globally optimal (see [58]).

Outline of HALLaR.

HALLaR is an inexact augmented Lagrangian (AL) method that generates sequences {Xt}\{X_{t}\} and {pt}\{p_{t}\} according to the recursions

Xt\displaystyle X_{t}\quad argminX{β(X;pt1):XΔn},\displaystyle\approx\quad\operatorname*{arg\,min}_{X}\quad\{\mathcal{L}_{\beta}(X;p_{t-1})\;:\;X\in\Delta^{n}\}, (2a)
pt\displaystyle p_{t}\quad =pt1+β(𝒜Xtb)\displaystyle=\quad p_{t-1}+\beta(\mathcal{A}X_{t}-b) (2b)

where

β(X;p):=CX+pT(𝒜Xb)+β2𝒜Xb2.\displaystyle\mathcal{L}_{\beta}(X;p)\quad:=\quad C\bullet X+p^{T}(\mathcal{A}X-b)+\frac{\beta}{2}\|\mathcal{A}X-b\|^{2}. (3)

The key part of HALLaR is an efficient method, called Hybrid Low-Rank (HLR), for finding an approximate global solution XtX_{t} of the AL subproblem (2a). The HLR method solves subproblems of the form

minY{β(YYT;pt1):YF1,Yn×r}\displaystyle\min_{Y}\quad\left\{\mathcal{L}_{\beta}(YY^{T};p_{t-1})\quad:\quad\|Y\|_{F}\leq 1,\quad Y\in{\mathbb{R}}^{n\times r}\right\} (Lr\mathrm{L}_{r})

for some integer r1r\geq 1. Subproblem (Lr\mathrm{L}_{r}) is equivalent to the subproblem obtained by restricting XX in (2a) to matrices with rank at most rr. Since (Lr\mathrm{L}_{r}) is nonconvex, it may have a spurious (near) stationary point, i.e., a (near) stationary point YY such that YYTYY^{T} is not (nearly) optimal for (2a).

More specifically, HLR finds an approximate global solution XtX_{t} of (2a) by solving a sequence of nonconvex subproblems (Lrk)k1(\mathrm{L}_{r_{k}})_{k\geq 1} such that rk+1rk+1r_{k+1}\leq r_{k}+1, according to following steps: i) find a near stationary point Y=Ykn×rkY=Y_{k}\in\mathbb{R}^{n\times r_{k}} of (Lrk)(\mathrm{L}_{r_{k}}) using an adaptive accelerated inexact proximal point (ADAP-AIPP) method that is based on a combination of ideas developed in [56, 13, 46, 37, 36]; ii) check if YkYkTY_{k}Y_{k}^{T} is nearly optimal for (2a) through a minimum eigenvalue computation and terminate the method if so; else iii) use the following escaping strategy to move away from the current spurious near stationary point YkY_{k}: perform a Frank-Wolfe (FW) step from YkY_{k} to obtain a point Y~k\tilde{Y}_{k} with either one column in which case (the unlikely one) rk+1r_{k+1} is set to one, or with rk+1r_{k}+1 columns in which case rk+1r_{k+1} is set to rk+1r_{k}+1, and use Y~k\tilde{Y}_{k} as the initial iterate for solving Lrk+1\mathrm{L}_{r_{k+1}}. The initial pair (r1,Y~0)(r_{1},\tilde{Y}_{0}) for HLR is chosen by using a warm start strategy, namely, as the pair obtained at the end of the HLR call for solving the previous subproblem (2a). It is worth noting that HALLaR only stores the current iterate YY and never computes the (implicit) iterate YYTYY^{T} (lying in the XX-space).

Under the strong duality assumption, it is shown that HALLaR obtains an approximate primal-dual solution of (P) and (D) with provable computational complexity bounds expressed in terms of parameters associated with the SDP instance and user-specified tolerances.

Computational impact.

Our computational results show that HALLaR performs very well on many large-scale SDPs such as phase retrieval, maximum-stable-set, and matrix completion. In all these applications, HALLaR efficiently obtains accurate solutions for large-scale instances, largely outperforming other state-of-the-art solvers. For example, HALLaR takes approximately 1.751.75 hours (resp. 1313 hours) on a personal laptop to solve within 10510^{-5} relative precision maximum stable set SDP instance for a Hamming graph with n4,000,000n\approx 4,000,000 and m40,000,000m\approx 40,000,000 (resp. n16,000,000n\approx 16,000,000 and m200,000,000m\approx 200,000,000). Moreover, HALLaR takes approximately 7.57.5 hours on a personal laptop to solve within 10510^{-5} relative precision a phase retrieval SDP instance with n=1,000,000n=1,000,000 and m=12,000,000m=12,000,000. An important reason for the good computational performance of HALLaR is that the rank of the iterates XtX_{t} remain relatively small throughout the whole algorithm.

Related works.

This part describes other methods for solving large-scale SDPs. SDPNAL+ [65, 60] is an AL based method that solves each AL subproblem using a semismooth Newton-CG method. Algorithms based on spectral bundle methods (more generally bundle methods) have also been proposed and studied for solving large-scale SDPs (e.g., [31, 9, 30, 19, 44]). The more recent works (e.g., [45, 25, 18, 41, 66]) propose methods for solving large-scale SDPs based on the alternating direction method of multipliers (ADMM). The remaining of this section discusses in more detail works that rely on the nonconvex LR approach and the FW method, since those works are more closely related to this paper. The reader is referred to the survey paper [42] for additional methods for solving large scale SDPs.

The nonconvex LR approach of [7, 8] has been successful in solving many relevant classes of SDPs. The SDPLR method developed in these works solves (PrP_{r}) with an AL method whose AL subproblems are solved by a limited memory BFGS method. Although SDPLR only handles equality constraints, it is possible to modify it to handle inequalities (e.g., [38]). HALLaR is also based on the AL method but it applies it directly to (P) instead of (PrP_{r}). Moreover, in contrast to SDPLR, HALLaR solves the AL subproblems using the HLR method outlined above.

This paragraph describes works that solve (possibly a sequence of) (PrP_{r}) without using  the AL method. Approaches that use interior point methods for solving (PrP_{r}) have been pursued for example in [51]. In the context of MaxCut SDPs, several specialized methods have been proposed which solve (PrP_{r}) using optimization techniques which preserves feasibility (e.g., [6, 32, 21, 43, 35]). Finally, Riemannian optimization methods have been used to solve special classes of SDPs where the feasible sets for (PrP_{r}) are smooth manifolds (e.g., [52, 35, 33, 43]).

The FW method minimizes a convex function g(X)g(X) over a compact convex domain (e.g., the spectraplex Δn\Delta^{n}). It is appealing when a sparse solution is desired, where the notion of sparsity is broad (e.g., small cardinality and/or rank). The FW method has been used (e.g., [29, 55, 34]) for solving SDP feasibility problems by minimizing g(X)=ϕ(𝒜Xb)g(X)=\phi(\mathcal{A}X-b) where ϕ\phi is either the squared norm function 2\|\cdot\|^{2} or the function LSE(y)=log(iexpyi)\mathrm{LSE}(y)=\log(\sum_{i}\exp y_{i}). Several papers (e.g., [23, 28, 49]) introduce variants of the FW method for general convex optimization problems.

Another interesting method for solving (P) is CGAL of [62, 63] which generates its iterates by performing only FW steps with respect to AL subproblems in the same format as (2a). As HALLaR, the method of [63] only generates iterates in the YY-space. Its Lagrange multiplier update policy though differs from (2b) in that it updates the Lagrange multiplier in a more conservative way, i.e., with β\beta in (2b) replaced by a usually much smaller γt>0\gamma_{t}>0, and does so only when the size of the new tentative multiplier is not too large. Moreover, instead of using a pure FW method to solve (2a), an iteration of the subroutine HLR invoked by HALLaR to solve (2a) consists of an ADAP-AIPP call applied to (Lr\mathrm{L}_{r}) and, if HLR does not terminate, also a FW step (which generally increases the rank of the iterate by one). As demonstrated by our computational results, the use of ADAP-AIPP calls significantly reduces the number of FW steps performed by HALLaR, and, as a by-product, keeps the ranks of its iterates considerably smaller than those of the CGAL iterates.

The CGAL method was enhanced in [64] to derive a low-storage variant, namely, Sketchy-CGAL. Instead of explicitly storing its most recent YY-iterate as CGAL does, this variant computes a certain approximation of the above iterates lying in n×r{\mathbb{R}}^{n\times r} where r{1,,n1}r\in\{1,\ldots,n-1\} is a specified threshold value whose purpose is to limit the rank of the stored approximation. It is shown in [64] that Sketchy-CGAL has O(m+nr)O(m+nr) memory storage, and that it outputs an O(r/(rr1))O(r^{*}/(r-r^{*}-1))-approximate solution of (P) (constructed using the sketch) under the assumption that r>r+1r>r^{*}+1, where rr^{*} is the largest among the ranks of all optimal solutions of (P). In contrast to either CGAL or HALLaR, a disadvantage of Sketchy-CGAL is that the accuracy of its output primal approximate solution is often low and degrades further as rr decreases, and can even be undetermined if rr+1r\leq r^{*}+1. Finally, alternative methods for solving SDPs with O(m+nr)O(m+nr^{*}) memory storage are presented in [20, 55, 59].

Structure of the paper.

This paper is organized into four sections. Section 2 discusses the HLR method for solving the AL subproblem (2a) and, more generally, smooth convex optimization problems over the spectraplex Δn\Delta^{n}. It also presents complexity bounds for HLR, given in Theorem 2.5. Section 3 presents HALLaR for solving the pair of SDPs (P) and (D) and presents the main complexity result of this paper, namely Theorem 3.2, which provides complexity bounds for HALLaR. Finally, Section 4 presents computational experiments comparing HALLaR with various solvers in a large collection of SDPs arising from stable set, phase retrieval, and matrix completion problems.

1.1   Basic Definitions and Notations

Let n{\mathbb{R}}^{n} be the space of nn dimensional vectors, n×r{\mathbb{R}}^{n\times r} the space of n×rn\times r matrices, and 𝕊n\mathbb{S}^{n} the space of n×nn\times n symmetric matrices. Let ++n{\mathbb{R}}_{++}^{n} (+n{\mathbb{R}}^{n}_{+}) be the convex cone in n{\mathbb{R}}^{n} of vectors with positive (nonnegative) entries, and let 𝕊++n\mathbb{S}_{++}^{n} (𝕊+n\mathbb{S}^{n}_{+}) be the convex cone in 𝕊n\mathbb{S}^{n} of positive (semi)definite matrices. Let ,\langle\cdot,\cdot\rangle and \|\cdot\| be the Euclidean inner product and norm on n{\mathbb{R}}^{n}, and let \bullet and F\|\cdot\|_{F} be the Frobenius inner product and norm on 𝕊n\mathbb{S}^{n}. The minimum eigenvalue of a matrix Q𝕊nQ\in\mathbb{S}^{n} is denoted by λmin(Q)\lambda_{\min}(Q), and vmin(Q)v_{\min}(Q) denotes a corresponding eigenvector of unit norm. For any t>0t>0 and a0a\geq 0, let loga+(t):=max{logt,a}\log_{a}^{+}(t):=\max\{\log t,a\}.

For a given closed convex set CnC\subseteq{\mathbb{R}}^{n}, its boundary is denoted by C\partial C and the distance of a point znz\in{\mathbb{R}}^{n} to CC is denoted by dist(z,C){\rm dist}(z,C). The diameter of CC, denoted DCD_{C}, is defined as

DC:=sup{ZZ:Z,ZC}.D_{C}:=\sup\{\|Z-Z^{\prime}\|:Z,Z^{\prime}\in C\}. (4)

The indicator function of CC, denoted by δC\delta_{C}, is defined by δC(z)=0\delta_{C}(z)=0 if zCz\in C, and δC(z)=+\delta_{C}(z)=+\infty otherwise. The domain of a function h:n(,]h:{\mathbb{R}}^{n}\to(-\infty,\infty] is the set domh:={xn:h(x)<+}\mathrm{dom}\,h:=\{x\in{\mathbb{R}}^{n}:h(x)<+\infty\}. Moreover, hh is said to be proper if domh\mathrm{dom}\,h\neq\emptyset. The ϵ\epsilon-subdifferential of a proper convex function h:n(,]h:{\mathbb{R}}^{n}\to(-\infty,\infty] is defined by

ϵh(z):={un:h(z)h(z)+u,zzϵ,zn}\partial_{\epsilon}h(z):=\{u\in{\mathbb{R}}^{n}:h(z^{\prime})\geq h(z)+\langle u,z^{\prime}-z\rangle-\epsilon,\quad\forall z^{\prime}\in{\mathbb{R}}^{n}\} (5)

for every znz\in{\mathbb{R}}^{n}. The classical subdifferential, denoted by h()\partial h(\cdot), corresponds to 0h()\partial_{0}h(\cdot). Recall that, for a given ϵ0\epsilon\geq 0, the ϵ\epsilon-normal cone of a closed convex set CC at zCz\in C, denoted by NCϵ(z)N^{\epsilon}_{C}(z), is

NCϵ(z):={ξn:ξ,uzϵ,uC}.N^{\epsilon}_{C}(z):=\{\xi\in{\mathbb{R}}^{n}:\langle\xi,u-z\rangle\leq\epsilon,\quad\forall u\in C\}.

The normal cone of a closed convex set CC at zCz\in C is denoted by NC(z)=NC0(z)N_{C}(z)=N^{0}_{C}(z).

Given a differentiable function ψ:n\psi:{\mathbb{R}}^{n}\to{\mathbb{R}}, its affine approximation at a point z¯n\bar{z}\in{\mathbb{R}}^{n} is

ψ(z;z¯):=ψ(z¯)+ψ(z¯),zz¯zn.\ell_{\psi}(z;\bar{z}):=\psi(\bar{z})+\langle\nabla\psi(\bar{z}),z-\bar{z}\rangle\quad\forall z\in{\mathbb{R}}^{n}. (6)

The function ψ\psi is LL-smooth on a set Ωn\Omega\subseteq{\mathbb{R}}^{n} if its gradient is LL-Lipschitz continuous on Ω\Omega, i.e.,

ψ(x)ψ(x)Lxxx,xΩ.\|\nabla\psi(x^{\prime})-\nabla\psi(x)\|\leq L\|x^{\prime}-x\|\quad\forall x,x^{\prime}\in\Omega. (7)

The set of LL-smooth functions on Ω\Omega is denoted by 𝒞1(Ω;L)\mathcal{C}^{1}(\Omega;L).

2   Hybrid Low-Rank Method

This section introduces a Hybrid Low-Rank (HLR) method which, as outlined in the introduction, uses a combination of the ADAP-AIPP method and Frank-Wolfe steps for approximately solving convex problems of the form as in (2a). This section consists of three subsections. The first subsection introduces the main problem that the HLR method considers and introduces a notion of the type of approximate solution that it aims to find. The second subsection presents the ADAP-AIPP method and its complexity results. The third subsection states the complete HLR method and establishes its total complexity.

2.1   Problem of Interest and Solution Type

Let g:𝕊ng:\mathbb{S}^{n}\to{\mathbb{R}} be a convex and differentiable function. The HLR method is developed in the context of solving the problem

g:=min{g(Z):ZΔn}g_{*}:=\min\left\{g\left(Z\right):Z\in\Delta^{n}\right\} (8)

where Δn\Delta^{n} is the spectraplex as in (1) and gg is LgL_{g}-smooth on Δn\Delta^{n}, i.e., there exists Lg0L_{g}\geq 0 such that

g(Z)g(Z)FLgZZFZ,ZΔn.\|\nabla g(Z^{\prime})-\nabla g(Z)\|_{F}\leq L_{g}\|Z^{\prime}-Z\|_{F}\quad\forall Z,Z^{\prime}\in\Delta^{n}. (9)

The goal of the HLR method is to find a near-optimal solution of (8) whose definition is given immediately after the next result.

Lemma 2.1.

Let ZΔnZ\in\Delta^{n} be given and define

θ(Z):=max{λmin(g(Z)),0}.\theta(Z):=\max\{-\lambda_{\min}(\nabla g(Z)),0\}. (10)

Then:

  • a)

    there hold

    θ(Z)0,g(Z)+θ(Z)I0;\theta(Z)\geq 0,\quad\nabla g(Z)+\theta(Z)I\succeq 0; (11)
  • b)

    for any ϵ>0\epsilon>0, the inclusion holds

    0g(Z)+ϵδΔn(Z)0\in\nabla g(Z)+\partial_{\epsilon}\delta_{\Delta^{n}}(Z) (12)

    if and only if

    g(Z)Z+θ(Z)ϵ.\nabla g(Z)\bullet Z+\theta(Z)\leq\epsilon. (13)
Proof.

(a) The result is immediate from the definition of θ(Z)\theta(Z) in (10).

(b) It is easy to see that ϵδΔn(Z)=NΔnϵ(Z)\partial_{\epsilon}\delta_{\Delta^{n}}(Z)=N^{\epsilon}_{\Delta^{n}}(Z). Statement (b) then follows immediately from this observation, the definition of θ(Z)\theta(Z) in (10), and Proposition A.2(b) with G=g(Z)G=\nabla g(Z). ∎

Relation (13) provides an easily verifiable condition for checking whether ZZ satisfies inclusion (12). Moreover, (13) is equivalent to the “complementary slackness” condition

(1trZ)θ(Z)+[g(Z)+θ(Z)I]Zϵ.\displaystyle\left(1-\operatorname{tr}Z\right)\theta(Z)+[\nabla g(Z)+\theta(Z)I]\bullet Z\leq\epsilon. (14)
Definition 2.2.

An ϵ\epsilon-optimal solution of (8) is a matrix ZΔnZ\in\Delta^{n} satisfying relation (12) or (13).

The next lemma shows that the objective value of an ϵ\epsilon-optimal solution of (8) is within ϵ\epsilon of the optimal value of (8).

Lemma 2.3.

An ϵ\epsilon-optimal solution ZZ of (8) satisfies that g(Z)gϵg(Z)-g_{*}\leq\epsilon.

Proof.

Let ZZ_{*} be an optimal solution of (8). Relation (12) implies that g(Z)NΔnϵ(Z)-\nabla g(Z)\in N^{\epsilon}_{\Delta^{n}}(Z) and hence that g(Z),ZZϵ\langle-\nabla g(Z),Z_{*}-Z\rangle\leq\epsilon. It then follows from this relation and the fact that gg is convex that

g(Z)g(Z)g(Z),ZZϵ,g(Z_{*})-g(Z)\geq\langle\nabla g(Z),Z_{*}-Z\rangle\geq-\epsilon,

which immediately implies the result. ∎

2.2   The ADAP-AIPP Method

As already mentioned in the introduction, one iteration of the HLR method consists of a call to the ADAP-AIPP method followed by a FW step. The purpose of this subsection is to describe the details of the ADAP-AIPP method.

For a given integer ss, consider the the subproblem obtained by restricting (8) to matrices ZZ of rank at most ss, or equivalently, the reformulation

min{g~(U):=g(UUT):UB¯1s},\min\{\tilde{g}(U):=g(UU^{T})\;:\>U\in\bar{B}_{1}^{s}\}, (15)

where

B¯rs:={Un×s:UFr}\bar{B}_{r}^{s}:=\{U\in{\mathbb{R}}^{n\times s}:\|U\|_{F}\leq r\} (16)

denotes the Frobenius ball of radius rr in n×s{\mathbb{R}}^{n\times s}. In this subsection, the above set will be denoted by B¯r\bar{B}_{r} since the column dimension ss remains constant throughout its presentation.

The goal of the ADAP-AIPP method is to find an approximate stationary solution of (15) as described in Proposition 2.4(a) below. Briefly, ADAP-AIPP is an inexact proximal point method which attempts to solve its (potentially nonconvex) prox subproblems using an accelerated composite gradient method, namely, ADAP-FISTA, whose description is given in Appendix B. A rough description of the jj-th iteration of ADAP-AIPP is as follows: given Wj1B¯1W_{j-1}\in\bar{B}_{1} and a positive scalar λj1\lambda_{j-1}, ADAP-AIPP calls the ADAP-FISTA method to attempt to find a suitable approximate solution of the possibly nonconvex proximal subproblem

minUB¯1{λg~(U)+12UWj1F2},\displaystyle\min_{U\in\bar{B}_{1}}\left\{\lambda\tilde{g}(U)+\frac{1}{2}\|U-W_{j-1}\|^{2}_{F}\right\}, (17)

where the first call made is always performed with λ=λj1\lambda=\lambda_{j-1}. If ADAP-FISTA successfully finds such a solution, ADAP-AIPP sets this solution as its next iterate WjW_{j} and sets λ{\lambda} as its next prox stepsize λj\lambda_{j}. If ADAP-FISTA is unsuccessful, ADAP-AIPP invokes it again to attempt to solve (17) with λ=λ/2\lambda=\lambda/2. This loop always terminates since ADAP-FISTA is guaranteed to terminate with success when the objective in (17) becomes strongly convex, which occurs when λ\lambda is sufficiently small.

The formal description of the ADAP-AIPP method is presented below. For the sake of simplifying the input lists of the algorithms stated throughout this paper, the parameters σ\sigma and χ\chi are considered universal ones (and hence not input parameters).

 

ADAP-AIPP Method

 

Universal Parameters: σ(0,1/2)\sigma\in(0,1/2) and χ(0,1)\chi\in(0,1).

Input: quadruple (g~,λ0,W¯,ρ¯)(𝒞1(Δn;Lg),++,B¯1,++)(\tilde{g},\lambda_{0},\underline{W},\bar{\rho})\in(\mathcal{C}^{1}\left(\Delta^{n};L_{g}\right),{\mathbb{R}}_{++},\bar{B}_{1},{\mathbb{R}}_{++}).

  • 0.

    set W0=W¯W_{0}=\underline{W}, j=1j=1, and

    λ=λ0,M¯0=1;\lambda=\lambda_{0},\quad\bar{M}_{0}=1; (18)
  • 1.

    choose M¯j[1,M¯j1]\underline{M}_{j}\in[1,\bar{M}_{j-1}] and call the ADAP-FISTA method in Appendix B with universal input (σ,χ)(\sigma,\chi) and inputs

    x0\displaystyle x_{0} =Wj1,(μ,L0)=(1/2,M¯j),\displaystyle=W_{j-1},\quad(\mu,L_{0})=(1/2,\underline{M}_{j}), (19)
    ψs\displaystyle\psi_{s} =λg~+12Wj1F2,ψn=λδB¯1;\displaystyle={\lambda}\tilde{g}+\frac{1}{2}\|\cdot-W_{j-1}\|_{F}^{2},\quad\psi_{n}={\lambda}\delta_{\bar{B}_{1}}; (20)
  • 2.

    if ADAP-FISTA fails or its output (W,V,L)(W,V,L) (if it succeeds) does not satisfy the inequality

    λg~(Wj1)[λg~(W)+12WWj1F2]V(Wj1W),\lambda\tilde{g}(W_{j-1})-\left[\lambda\tilde{g}(W)+\frac{1}{2}\|W-W_{j-1}\|_{F}^{2}\right]\geq V\bullet(W_{j-1}-W), (21)

    then set λ=λ/2{\lambda}={\lambda}/2 and go to step 11; else, set (λj,M¯j)=(λ,L)({\lambda}_{j},\bar{M}_{j})=({\lambda},L), (Wj,Vj)=(W,V)(W_{j},V_{j})=(W,V), and

    Rj:=Vj+Wj1Wjλj\displaystyle R_{j}:=\frac{V_{j}+W_{j-1}-W_{j}}{\lambda_{j}} (22)

    and go to step 3;

  • 3.

    if RjFρ¯\|R_{j}\|_{F}\leq\bar{\rho}, then stop with success and output (W¯,R¯)=(Wj,Rj)(\overline{W},\overline{R})=(W_{j},R_{j}); else, go to step 4;

  • 4.

    set jj+1j\leftarrow j+1 and go to step 1.

 

Several remarks about ADAP-AIPP are now given. First, at each iteration, steps 1 and 2 successively call the ADAP-FISTA method with inputs given by (19) and (20) to obtain a prox stepsize λjλj1\lambda_{j}\leq\lambda_{j-1} and a pair (Wj,Vj)(W_{j},V_{j}) satisfying (21) and

VjFσWjWj1F,Vjλj[g~(Wj)+δB¯1(Wj)]+(WjWj1)\|V_{j}\|_{F}\leq\sigma\|W_{j}-W_{j-1}\|_{F},\quad V_{j}\in\lambda_{j}\left[\nabla\tilde{g}(W_{j})+\partial\delta_{\bar{B}_{1}}(W_{j})\right]+(W_{j}-W_{j-1}) (23)

where σ\sigma is part of the input of ADAP-AIPP. Such a pair (Wj,Vj)(W_{j},V_{j}) can be viewed as an approximate stationary solution of prox subproblem (17) with λ=λj\lambda=\lambda_{j}, where the residual VjV_{j} is relaxed from being zero to a quantity that is now relatively bounded as in (23). Second, it follows immediately from the inclusion in relation (23) and the definition of RjR_{j} in (22) that the pair (Wj,Rj)(W_{j},R_{j}) computed in step 2 of ADAP-AIPP satisfies the inclusion Rjg~(Wj)+δB¯1(Wj)R_{j}\in\nabla\tilde{g}(W_{j})+\partial\delta_{\bar{B}_{1}}(W_{j}) for every iteration j1j\geq 1. As a consequence, if ADAP-AIPP terminates in step 3, then the pair (W¯,R¯)=(Wj,Rj)(\overline{W},\overline{R})=(W_{j},R_{j}) output by this step is a ρ¯\bar{\rho}-approximate stationary solution of (15), i.e., it satisfies

R¯g~(W¯)+δB¯1(W¯),R¯Fρ¯.\overline{R}\in\nabla\tilde{g}(\overline{W})+\partial\delta_{\bar{B}_{1}}(\overline{W}),\quad\|\overline{R}\|_{F}\leq\bar{\rho}. (24)

Finally, it is interesting to note that ADAP-AIPP is a universal method in that it requires no knowledge of any parameters (such as objective function curvatures) underlying problem (15).

Before stating the main complexity result of the ADAP-AIPP method, the following quantities are introduced

G¯:=sup{g(UUT)F:UB¯3},Lg~:=2G¯+36Lg,\bar{G}:=\sup\{\|\nabla g(UU^{T})\|_{F}:U\in\bar{B}_{3}\},\quad L_{\tilde{g}}:=2\bar{G}+36L_{g}, (25)
λ¯:=min{λ0,1/(4Lg~)},Cσ=2(1σ)212σ,\underline{\lambda}:=\min\{\lambda_{0},1/(4L_{\tilde{g}})\},\quad C_{\sigma}=\frac{2(1-\sigma)^{2}}{1-2\sigma}, (26)

where λ0\lambda_{0} is the initial prox stepsize of ADAP-AIPP and LgL_{g} and B¯3\bar{B}_{3} are as in (9) and (16), respectively. Observe that CσC_{\sigma} is well-defined and positive due to the fact that σ(0,1/2)\sigma\in(0,1/2).

The main complexity result of ADAP-AIPP is stated in the proposition below. Its proof is in Appendix C.

Proposition 2.4.

The following statements about ADAP-AIPP hold:

  • (a)

    ADAP-AIPP terminates with a pair (W¯,R¯)(\overline{W},\overline{R}) that is a ρ¯\bar{\rho}-approximate stationary solution of (15) and its last iteration index ll satisfies

    1l𝒯:=1+Cσλ¯ρ¯2[g~(W¯)g~(W¯)],1\leq l\leq\mathcal{T}:=1+\frac{C_{\sigma}}{\underline{\lambda}\bar{\rho}^{2}}\left[\tilde{g}(\underline{W})-\tilde{g}(\overline{W})\right], (27)

    where ρ¯>0\bar{\rho}>0 is an input tolerance, W¯\underline{W} is the initial point, and CσC_{\sigma} and λ¯\underline{\lambda} are as in (26);

  • (b)

    the total number of ADAP-FISTA calls performed by ADAP-AIPP is no more than

    𝒯+log0+(λ0/λ¯)/log2\mathcal{T}+\lceil\log_{0}^{+}(\lambda_{0}/{\underline{{\lambda}}})/\log 2\rceil (28)

    where λ0\lambda_{0} is the initial prox stepsize and 𝒯\mathcal{T} is as in (27).

Some remarks about Proposition 2.4 are now in order. First, it follows from statement (a) that g~(W¯)g~(W¯)\tilde{g}(\overline{W})\leq\tilde{g}(\underline{W}). Second, recall that each ADAP-AIPP iteration may perform more than a single ADAP-FISTA call. Statement (b) implies that the total number of ADAP-FISTA calls performed is at most the total number of ADAP-AIPP iterations performed plus a logarithmic term.

2.3   HLR Method

The goal of this subsection is to describe the HLR method for solving problem (8). The formal description of the HLR method is presented below.

 

HLR Method

 

Input: A quintuple (Y¯0,g,ϵ¯,ρ¯,λ0)B¯1s0×𝒞1(Δn;Lg)×++3(\bar{Y}_{0},g,\bar{\epsilon},\bar{\rho},\lambda_{0})\in\bar{B}^{s_{0}}_{1}\times\mathcal{C}^{1}\left(\Delta^{n};L_{g}\right)\times{\mathbb{R}}^{3}_{++} for some s01s_{0}\geq 1.

Output: Y¯B¯1s\bar{Y}\in\bar{B}^{s}_{1} for some s1s\geq 1 such that Y¯Y¯T\bar{Y}\bar{Y}^{T} is an ϵ¯\bar{\epsilon}-optimal solution of (8).

  • 0.

    set Y~0=Y¯0\tilde{Y}_{0}=\bar{Y}_{0}, s=s0s=s_{0}, and k=1k=1.

  • 1.

    call ADAP-AIPP with quadruple (g,λ0,ρ¯,W¯)=(g,λ0,ρ¯,Y~k1)(g,\lambda_{0},\bar{\rho},\underline{W})=(g,\lambda_{0},\bar{\rho},\tilde{Y}_{k-1}) and let (Yk,k)B¯1s×n×s(Y_{k},\mathcal{R}_{k})\in\bar{B}^{s}_{1}\times{\mathbb{R}}^{n\times s} denote its output pair (W¯,R¯)(\overline{W},\overline{R});

  • 2.

    compute

    θk=max{λmin(Gk),0},yk={vmin(Gk) if θk>0,0 otherwise,\theta_{k}=\max\{-\lambda_{\min}(G_{k}),0\},\qquad y_{k}=\begin{cases}v_{\min}(G_{k})&\text{ if $\theta_{k}>0$},\\ 0&\text{ otherwise},\end{cases} (29)

    where

    Gk:=g(YkYkT)𝕊nG_{k}:=\nabla g(Y_{k}Y_{k}^{T})\in\mathbb{S}^{n} (30)

    and (λmin(Gk),vmin(Gk))×n({\lambda}_{\min}(G_{k}),v_{\min}(G_{k}))\in\mathbb{R}\times\mathbb{R}^{n} is a minimum eigenpair of GkG_{k};

  • 3.

    set

    ϵk:=(GkYk)Yk+θk.\epsilon_{k}:=(G_{k}Y_{k})\bullet Y_{k}+\theta_{k}. (31)

    If

    ϵkϵ¯\epsilon_{k}\leq\bar{\epsilon} (32)

    then stop and output pair (Y¯,θ¯)=(Yk,θk)(\bar{Y},\bar{\theta})=(Y_{k},\theta_{k}); else go to step 4;

  • 4.

    compute

    αk=argminα{g(αyk(yk)T+(1α)YkYkT):α[0,1]}}\alpha_{k}=\operatorname*{arg\,min}_{\alpha}\left\{g\left(\alpha y_{k}(y_{k})^{T}+(1-\alpha)Y_{k}Y^{T}_{k}\right):\alpha\in[0,1]\right\}\} (33)

    and set

    (Y~k,s)={(yk,1) if αk=1([1αkYk,αkyk],s+1) otherwise;(\tilde{Y}_{k},s)=\begin{cases}(y_{k},1)&\text{ if $\alpha_{k}=1$}\\ \left(\left[\sqrt{1-\alpha_{k}}Y_{k},\sqrt{\alpha_{k}}{y_{k}}\right],s+1\right)&\text{ otherwise};\end{cases} (34)
  • 5.

    set kk+1k\leftarrow k+1 and go to step 1.

 

Remarks about each of the steps in HLR are now given. First, step 1 of the kk-th iteration of HLR calls ADAP-AIPP with initial point Y~k1B¯1s\tilde{Y}_{k-1}\in\bar{B}_{1}^{s} to produce an iterate YkB¯1sY_{k}\in\bar{B}_{1}^{s} that is an ρ¯\bar{\rho}-approximate stationary solution of nonconvex problem (15). Second, step 2 performs a minimum eigenvector (MEV) computation to compute the quantity θk\theta_{k} in (29), which is needed for the termination check performed in step 3. Third, the definition of θ()\theta(\cdot) in (10), and relations (30), (29), and (31), imply that

θk=θ(YkYkT),ϵk=g(YkYkT)(YkYkT)+θ(YkYkT).\theta_{k}=\theta(Y_{k}Y_{k}^{T}),\quad\epsilon_{k}=\nabla g(Y_{k}Y_{k}^{T})\bullet(Y_{k}Y_{k}^{T})+\theta(Y_{k}Y_{k}^{T}).

Hence, it follows from Lemma 2.1(b) that termination criterion (32) in step 3 is equivalent to checking if YkYkTY_{k}Y_{k}^{T} is a ϵ¯\bar{\epsilon}-optimal solution of (8). Fourth, if the termination criterion in step 3 is not satisfied, a FW step at YkYkTY_{k}Y_{k}^{T} for (8) is taken in step 4 to produce an iterate Y~kY~kT\tilde{Y}_{k}\tilde{Y}_{k}^{T} as in (33) and (34). The computation of Y~k\tilde{Y}_{k} is entirely performed in the YY-space to avoid forming matrices of the form YYTYY^{T}, and hence to save storage space. The reason for performing this FW step is to make HLR escape from the spurious near stationary point YkY_{k} of (15) (see the end of the paragraph containing (Lr\mathrm{L}_{r}) in the Introduction). Finally, the quantity ss as in (34) keeps track of the column dimension of the most recently generated Y~k\tilde{Y}_{k}. It can either increase by one or be set to one after the update (34) is performed.

The complexity of the HLR method is described in the result below whose proof is given in the next subsection.

Theorem 2.5.

The following statements about the HLR method hold:

  • (a)

    the HLR method outputs a point Y¯\bar{Y} such that Z¯=Y¯Y¯T\bar{Z}=\bar{Y}\bar{Y}^{T} is a ϵ¯\bar{\epsilon}-optimal solution of (8) in at most

    𝒮(ϵ¯):=1+4max{g(Z¯0)g,4Lg(g(Z¯0)g),4Lg}ϵ¯{\cal S}(\bar{\epsilon}):=\left\lceil 1+\frac{4\max\left\{g(\bar{Z}_{0})-g_{*},\sqrt{4L_{g}(g(\bar{Z}_{0})-g_{*})},4L_{g}\right\}}{\bar{\epsilon}}\right\rceil (35)

    iterations (and hence MEV computations) where Z¯0=Y¯0Y¯0T\bar{Z}_{0}=\bar{Y}_{0}\bar{Y}^{T}_{0}, LgL_{g} is as in (9), and gg_{*} is the optimal value of (8);

  • (b)

    the total number of ADAP-FISTA calls performed by the HLR method is no more than

    (1+𝒬)𝒮(ϵ¯)+Cσmax{8G¯+144Lg,1/λ0}ρ¯2[g(Z¯0)g(Z¯)]\displaystyle(1+\mathcal{Q}){\cal S}(\bar{\epsilon})+\frac{C_{\sigma}\max\{8\bar{G}+144L_{g},1/\lambda_{0}\}}{\bar{\rho}^{2}}\left[g(\bar{Z}_{0})-g(\bar{Z})\right] (36)

    where λ0\lambda_{0} is given as input to the HLR method, and G¯\bar{G}, CσC_{\sigma}, and 𝒬\mathcal{Q} are as in (25), (26), and (50), respectively.

Two remarks about Theorem 2.5 are now given. First, it follows from statement (a) that HLR performs 𝒪(1/ϵ¯)\mathcal{O}\left(1/\bar{\epsilon}\right) iterations (and hence MEV computations). Second, statement (b) implies that the total number of ADAP-FISTA calls performed by HLR is 𝒪(1/ϵ¯+1/ρ¯2)\mathcal{O}\left(1/\bar{\epsilon}+1/\bar{\rho}^{2}\right) where ρ¯\bar{\rho} is the input tolerance for each ADAP-AIPP call in step 1 of HLR.

Recall from the Introduction that an iteration of the HALLaR method described in (2a) and (2b) has to approximately solve subproblem (2a) and that the HLR method specialized to the case where g()=β(;p)g(\cdot)=\mathcal{L}_{\beta}(\cdot;p) is the novel proposed tool towards solving it. In the following, we discuss how to specialize some of the steps of the HLR to this case. Step 1 of HLR calls the ADAP-AIPP method whose steps can be easily implemented if it is known how to project a matrix onto the unit ball B¯1\bar{B}_{1} and how to compute g~(Y)\nabla\tilde{g}(Y) where g~(Y)\tilde{g}(Y) is as in (15). Projecting a matrix onto the unit ball is easy as it just involves dividing the matrix by its norm. Define the quantities

q(Y;p):=p+β(𝒜(YYT)b),θ~(Y;p):=max{λmin[C+𝒜(q(Y;p))],0}.q(Y;p):=p+\beta(\mathcal{A}(YY^{T})-b),\quad\tilde{\theta}(Y;p):=\max\{-\lambda_{\min}[C+\mathcal{A}^{*}\left(q(Y;p)\right)],0\}. (37)

When g()=β(;p)g(\cdot)=\mathcal{L}_{\beta}(\cdot;p), the matrix g~(Y)\nabla\tilde{g}(Y) can be explicitly computed as

g~(Y)=2g(YYT)Y,g(YYT)=C+𝒜(q(Y;p)).\nabla\tilde{g}(Y)=2\nabla g(YY^{T})Y,\quad\nabla g(YY^{T})=C+\mathcal{A}^{*}(q(Y;p)). (38)

Also, the stepsize αk\alpha_{k} in step 4 of HLR has a closed form expression given by

αk=min{CYkYk+(𝒜qk)YkYk+θ~(Yk;p)β𝒜(YkYkT)𝒜(yk(yk)T)2, 1}\displaystyle\alpha_{k}=\min\left\{\frac{CY_{k}\bullet Y_{k}+(\mathcal{A}^{*}q_{k})Y_{k}\bullet Y_{k}+\tilde{\theta}(Y_{k};p)}{\beta\|\mathcal{A}(Y_{k}Y_{k}^{T})-\mathcal{A}(y_{k}(y_{k})^{T})\|^{2}},\;1\right\}

where

qk:=q(Yk;p),yk:={vmin(g(YkYkT)) if θ~(Yk;p)>0,0 otherwise.\displaystyle q_{k}:=q(Y_{k};p),\quad y_{k}:=\begin{cases}v_{\min}\left(\nabla g(Y_{k}Y_{k}^{T})\right)&\text{ if $\tilde{\theta}(Y_{k};p)>0$,}\\ 0&\text{ otherwise}.\end{cases} (39)

2.4   Proof of Theorem 2.5

This subsection provides the proof of Theorem 2.5. The following proposition establishes important properties of the iterates generated by HLR.

Proposition 2.6.

For every k1k\geq 1, define

Zk=YkYkT,ZkF=ykykT,Dk:=ZkZkF,Z~k=Y~kY~kT,Z_{k}=Y_{k}Y^{T}_{k},\quad Z^{F}_{k}=y_{k}y_{k}^{T},\quad D_{k}:=Z_{k}-Z_{k}^{F},\quad\tilde{Z}_{k}=\tilde{Y}_{k}\tilde{Y}^{T}_{k}, (40)

where yky_{k} is as in (31). Then, for every k1k\geq 1, the following relations hold:

ZkFargminU{g(U;Zk):UΔn};\displaystyle Z^{F}_{k}\in\operatorname*{arg\,min}_{U}\{\ell_{g}(U;Z_{k}):U\in\Delta^{n}\}; (41)
Z~k=ZkαkDk;\displaystyle\tilde{Z}_{k}=Z_{k}-\alpha_{k}D_{k}; (42)
ϵk=GkDk;\displaystyle\epsilon_{k}=G_{k}\bullet D_{k}; (43)
αk=argminα[0,1]g(ZkαDk)}\displaystyle\alpha_{k}=\operatorname*{arg\,min}_{\alpha\in[0,1]}g(Z_{k}-\alpha D_{k})\} (44)
g(Zk+1)g(Z~k)g(Zk)\displaystyle g(Z_{k+1})\leq g(\tilde{Z}_{k})\leq g(Z_{k}) (45)

where ϵk\epsilon_{k} and αk\alpha_{k} are as in (29) and (33), respectively. Moreover, for every k1k\geq 1, it holds that

θk0,Gk+θkI0,GkZk+θk=ϵk.\displaystyle\theta_{k}\geq 0,\quad G_{k}+\theta_{k}I\succeq 0,\quad G_{k}\bullet Z_{k}+\theta_{k}=\epsilon_{k}. (46)
Proof.

Relation (41) follows immediately from the way yky_{k} is computed in (31), the definitions of ZkZ_{k} and ZkFZ^{F}_{k} in (40), and Proposition A.1 with G=GkG=G_{k} and (ZF,θF)=(ZkF,θk)(Z^{F},\theta^{F})=(Z_{k}^{F},\theta_{k}).

To show relation (43), it is first necessary to show that

θk=GkZkF.\theta_{k}=-G_{k}\bullet Z_{k}^{F}. (47)

Consider two possible cases. For the first case, suppose that θk=0\theta_{k}=0, which in view of the definitions of yky_{k} and ZkFZ^{F}_{k} in (29) and (40), respectively implies that ZkF=0Z_{k}^{F}=0. Hence, relation (47) immediately follows. For the other case suppose that θk>0\theta_{k}>0 and hence (θk,yk)=(λmin(Gk),vmin(Gk))(\theta_{k},y_{k})=(-\lambda_{\min}(G_{k}),v_{\min}(G_{k})). This observation and the fact that (λmin(Gk),vmin(Gk))(\lambda_{\min}(G_{k}),v_{\min}(G_{k})) is an eigenpair of GkG_{k} imply that Gkykyk=θkG_{k}y_{k}\bullet y_{k}=-\theta_{k}, which in view of the definition of ZkFZ^{F}_{k} in (40) immediately implies relation (47).

It is now easy to see that the definition of ϵk\epsilon_{k} in (29), relation (47), the update rule for Y~k\tilde{Y}_{k} in (34), and the definitions of ZkZ_{k}, ZkFZ^{F}_{k}, and DkD_{k} in (40) imply that

Z~k\displaystyle\tilde{Z}_{k} =ZkαkDk\displaystyle=Z_{k}-\alpha_{k}D_{k}
ϵk\displaystyle\epsilon_{k} =Gk(ZkZkF)=GkDk\displaystyle=G_{k}\bullet(Z_{k}-Z^{F}_{k})=G_{k}\bullet D_{k}
αk\displaystyle\alpha_{k} =argminα{g(ZkαDk):α[0,1]}}\displaystyle=\operatorname*{arg\,min}_{\alpha}\left\{g(Z_{k}-\alpha D_{k}):\alpha\in[0,1]\right\}\}

and hence relations (42), (43), and (44) follow.

Relation (27) and the fact that HLR during its kk-th iteration calls ADAP-AIPP with initial point W¯=Y~k1\underline{W}=\tilde{Y}_{k-1} and outputs point W¯=Yk\overline{W}=Y_{k} imply that g~(Yk)g~(Y~k1)\tilde{g}(Y_{k})\leq\tilde{g}(\tilde{Y}_{k-1}). This fact together with the definition of g~\tilde{g} in (15) and the definitions of ZkZ_{k} and Z~k\tilde{Z}_{k} in (40) imply the first inequality in (45). Now the first inequality in (45) and relations (41), (42), (43), and (44) imply that Zk=YkYkTZ_{k}=Y_{k}Y_{k}^{T} and Z~k=Y~kY~kT\tilde{Z}_{k}=\tilde{Y}_{k}\tilde{Y}_{k}^{T} can be viewed as iterates of the kk-th iteration of the RFW method of Appendix D. The second inequality in (45) is then an immediate consequence of this observation and the second relation in (138).

The first two relations in (46) follow directly from the definitions of GkG_{k} and θk\theta_{k} in (30) and (29), respectively. The definitions of ϵk\epsilon_{k} and ZkZ_{k} in (31) and (40), respectively, immediately imply the third relation in (46). ∎

Since step 1 of the HLR method consists of a call to the ADAP-AIPP method developed in Subsection 2.2, the conclusion of Proposition 2.4 applies to this step. The following result, which will be useful in the analysis of the HLR method, translates the conclusion of Proposition 2.4 to the current setting.

Proposition 2.7.

The following statements about step 1 of the kk-th iteration of the HLR method hold:

  • (a)

    the ADAP-AIPP call terminates with a pair (Yk,k)(Y_{k},\mathcal{R}_{k}) satisfying

    kg~(Yk)+δB¯1(Yk),kFρ¯,\mathcal{R}_{k}\in\nabla\tilde{g}(Y_{k})+\partial\delta_{\bar{B}_{1}}(Y_{k}),\quad\|\mathcal{R}_{k}\|_{F}\leq\bar{\rho}, (48)

    and the number lkl_{k} of ADAP-AIPP iterations performed by the ADAP-AIPP call in step 1 satisfies

    1lk𝒯k:=1+Cσmax{8G¯+144Lg,1/λ0}ρ¯2[g~(Y~k1)g~(Yk)]1\leq l_{k}\leq{\cal T}_{k}:=1+\frac{C_{\sigma}\max\{8\bar{G}+144L_{g},1/\lambda_{0}\}}{\bar{\rho}^{2}}\left[\tilde{g}(\tilde{Y}_{k-1})-\tilde{g}(Y_{k})\right] (49)

    where λ0\lambda_{0} and ρ¯\bar{\rho} are given as input to the HLR method and LgL_{g}, g~\tilde{g}, G¯\bar{G}, and CσC_{\sigma} are as in (9), (15), (25), and (26) respectively;

  • (b)

    the number of ADAP-FISTA calls performed by the ADAP-AIPP call in step 1 is no more than 𝒯k+𝒬{\cal T}_{k}+\mathcal{Q} where

    𝒬:=log0+(λ0max{8G¯+144Lg,1/λ0})/log2.\mathcal{Q}:=\left\lceil\log_{0}^{+}\left(\lambda_{0}\max\{8\bar{G}+144L_{g},1/\lambda_{0}\}\right)/\log 2\right\rceil. (50)
Proof.

(a) The first statement is immediate from relation (24) and the fact that ADAP-AIPP outputs pair (Yk,k)=(W¯,R¯)(Y_{k},\mathcal{R}_{k})=(\overline{W},\overline{R}). To prove the second statement, suppose that the ADAP-AIPP call made in step 1 terminates after performing lkl_{k} iterations. It follows immediately from Proposition 2.4(a) and the fact that the HLR method during its kk-th iteration calls ADAP-AIPP with initial point W¯=Y~k1\underline{W}=\tilde{Y}_{k-1} and outputs point W¯=Yk\overline{W}=Y_{k} that lkl_{k} satisfies

1lk(27)1+Cσλ¯ρ¯2[g~(Y~k1)g~(Yk)].1\leq l_{k}\overset{\eqref{iteration bound}}{\leq}1+\frac{C_{\sigma}}{\underline{\lambda}\bar{\rho}^{2}}\left[\tilde{g}(\tilde{Y}_{k-1})-\tilde{g}(Y_{k})\right]. (51)

The result then follows from the definitions of λ¯\underline{\lambda} and Lg~L_{\tilde{g}} in (26) and (25), respectively.

(b) The result follows immediately from (a), the fact that the number of times λ{\lambda} is divided by 22 in step 2 of ADAP-AIPP is at most log0+(λ0/λ¯)/log2\lceil\log_{0}^{+}(\lambda_{0}/{\underline{{\lambda}}})/\log 2\rceil, and the definitions of λ¯\underline{\lambda} and Lg~L_{\tilde{g}} in (26) and (25), respectively. ∎

We are now ready to give the proof of Theorem 2.5.

Proof of Theorem 2.5.

(a) Consider the matrix Z¯=Y¯Y¯T\bar{Z}=\bar{Y}\bar{Y}^{T} where Y¯\bar{Y} is the output of the HLR method. The definition of GkG_{k} in (30), the fact that the definitions of θk\theta_{k} and θ()\theta(\cdot) in (29) and (10), respectively, imply that θk=θ(YkYkT)\theta_{k}=\theta(Y_{k}Y_{k}^{T}), relation (46), and the stopping criterion (32) in step 3 of the HLR method immediately imply that the pair (Z¯,θ(Z¯))(\bar{Z},\theta(\bar{Z})) satisfies relation (13) in Lemma 2.1(b) with ϵ=ϵ¯\epsilon=\bar{\epsilon} and hence Z¯\bar{Z} is an ϵ¯\bar{\epsilon}-optimal solution of (8). To show that the number of iterations that the HLR method performs to find such an ϵ¯\bar{\epsilon}-optimal solution is at most the quantity in (35), observe that Proposition 2.6 establishes that the HLR method generates iterates YkY_{k} and Y~k\tilde{Y}_{k} during its kk-th iteration such that Zk=YkYkTZ_{k}=Y_{k}Y_{k}^{T} and Z~k=Y~kY~kT\tilde{Z}_{k}=\tilde{Y}_{k}\tilde{Y}_{k}^{T} can be viewed as iterates of the kk-th iteration of the RFW method of Appendix D. The result then immediately follows from this observation, Theorem D.1, and the fact that the diameter of Δn\Delta^{n} is at most 2.

(b) Suppose that the HLR method terminates at an iteration index K¯\bar{K}. It follows from relations (50) and (35) and the definition of 𝒯k\mathcal{T}_{k} in (49) that the HLR method performs at most

(1+𝒬)𝒮(ϵ¯)+Cσmax{8G¯+144Lg,1/λ0}ρ¯2k=1K¯[g~(Y~k1)g~(Yk)]\displaystyle(1+\mathcal{Q}){\cal S}(\bar{\epsilon})+\frac{C_{\sigma}\max\{8\bar{G}+144L_{g},1/\lambda_{0}\}}{\bar{\rho}^{2}}\sum_{k=1}^{\bar{K}}\left[\tilde{g}(\tilde{Y}_{k-1})-\tilde{g}(Y_{k})\right] (52)

ADAP-FISTA calls. Now using the fact that g~(Y~k)g~(Yk)\tilde{g}(\tilde{Y}_{k})\leq\tilde{g}(Y_{k}), it is easy to see that the last term in (52) is summable:

k=1K¯g~(Y~k1)g~(Yk)\displaystyle\sum_{k=1}^{\bar{K}}\tilde{g}(\tilde{Y}_{k-1})-\tilde{g}(Y_{k}) =g~(Y~K¯1)g~(YK¯)+k=1K¯1g~(Y~k1)g~(Yk)\displaystyle=\tilde{g}(\tilde{Y}_{\bar{K}-1})-\tilde{g}(Y_{\bar{K}})+\sum_{k=1}^{\bar{K}-1}\tilde{g}(\tilde{Y}_{k-1})-\tilde{g}(Y_{k})
g~(Y~K¯1)g~(YK¯)+k=1K¯1g~(Y~k1)g~(Y~k)=g~(Y~0)g~(YK¯).\displaystyle\leq\tilde{g}(\tilde{Y}_{\bar{K}-1})-\tilde{g}(Y_{\bar{K}})+\sum_{k=1}^{\bar{K}-1}\tilde{g}(\tilde{Y}_{k-1})-\tilde{g}(\tilde{Y}_{k})=\tilde{g}(\tilde{Y}_{0})-\tilde{g}(Y_{\bar{K}}). (53)

The result then follows from relations (52) and (2.4), the facts that YK¯=Y¯Y_{\bar{K}}=\bar{Y}, Z¯=Y¯Y¯T\bar{Z}=\bar{Y}\bar{Y}^{T}, Y~0=Y¯0\tilde{Y}_{0}=\bar{Y}_{0}, Z¯0=Y¯0Y¯0T\bar{Z}_{0}=\bar{Y}_{0}\bar{Y}^{T}_{0}, and the definition of g~\tilde{g} in (15). ∎

3   HALLaR

This section presents an inexact AL method for solving the pair of primal-dual SDPs (P) and (D), namely, HALLaR, whose outline is given in the introduction. It contains two subsections. Subsection 3.1 formally states HALLaR and presents its main complexity result, namely Theorem 3.2. Subsection 3.2 is devoted to the proof of Theorem 3.2.

Throughout this section, it is assumed that (P) and (D) have optimal solutions XX_{*} and (p,θ)(p_{*},\theta_{*}), respectively, and that both (P) and (D) have the same optimal value. It is well-known that such an assumption is equivalent to the existence of a triple (X,p,θ)(X_{*},p_{*},\theta_{*}) satisfying the optimality conditions:

(primal feasibility) 𝒜(X)b=0,tr(X)1,X0,\displaystyle\qquad\mathcal{A}(X_{*})-b=0,\quad\operatorname{tr}(X_{*})\leq 1,\quad X_{*}\succeq 0,
(dual feasibility) S:=C+𝒜p+θI0,θ0,\displaystyle\qquad S_{*}:=C+\mathcal{A}^{*}p_{*}+\theta_{*}I\succeq 0,\quad\theta_{*}\geq 0, (54)
(complementarity) X,S=0,θ(1trX)=0.\displaystyle\qquad\langle X_{*},S_{*}\rangle=0,\quad\theta_{*}(1-\operatorname{tr}X_{*})=0.

This section studies the complexity of HALLaR for finding an (ϵp,ϵc)(\epsilon_{\mathrm{p}},\epsilon_{\mathrm{c}})-solution of (54), i.e., a triple (X¯,p¯,θ¯)(\bar{X},\bar{p},\bar{\theta}) that satisfies

(ϵp\epsilon_{\mathrm{p}}-primal feasibility) 𝒜(X¯)bϵp,tr(X¯)1,X¯0,\displaystyle\qquad\|\mathcal{A}(\bar{X})-b\|\leq\epsilon_{\mathrm{p}},\quad\operatorname{tr}(\bar{X})\leq 1,\quad\bar{X}\succeq 0,
(dual feasibility) S¯:=C+𝒜p¯+θ¯I0,θ¯0,\displaystyle\qquad\bar{S}:=C+\mathcal{A}^{*}\bar{p}+\bar{\theta}I\succeq 0,\quad\bar{\theta}\geq 0, (55)
(ϵc\epsilon_{\mathrm{c}}-complementarity) X¯,S+θ¯(1trX¯)ϵc.\displaystyle\qquad\langle\bar{X},S\rangle+\bar{\theta}(1-\operatorname{tr}\bar{X})\leq\epsilon_{\mathrm{c}}.

3.1   Description of HALLaR and Main Theorem

The formal description of HALLaR is presented next.

 

HALLaR Method

 

Input: Initial points (U0,p0)B¯1s0×m(U_{0},p_{0})\in\bar{B}^{s_{0}}_{1}\times{\mathbb{R}}^{m}, tolerance pair (ϵc,ϵp)++2(\epsilon_{\mathrm{c}},\epsilon_{\mathrm{p}})\in\mathbb{R}^{2}_{++}, penalty parameter β++\beta\in{\mathbb{R}}_{++}, and ADAP-AIPP parameters (ρ¯,λ0)++2(\bar{\rho},\lambda_{0})\in{\mathbb{R}}_{++}^{2}.

Output: (X¯,p¯,θ¯)Δn×m×+(\bar{X},\bar{p},\bar{\theta})\in\Delta^{n}\times{\mathbb{R}}^{m}\times{\mathbb{R}}_{+}, an (ϵp,ϵc)(\epsilon_{\mathrm{p}},\epsilon_{\mathrm{c}})-solution of (54)

  • 0.

    set t=1t=1 and

    ϵ¯=min{ϵc,ϵp2β/6};\bar{\epsilon}=\min\{\epsilon_{\mathrm{c}},\;{\epsilon_{\mathrm{p}}^{2}\beta}/{6}\}; (56)
  • 1.

    call HLR with input (g,Y¯0,λ0,ϵ¯,ρ¯)=(β(;pt1),Ut1,λ0,ϵ¯,ρ¯)(g,\bar{Y}_{0},\lambda_{0},\bar{\epsilon},\bar{\rho})=(\mathcal{L}_{\beta}(\cdot;p_{t-1}),U_{t-1},\lambda_{0},\bar{\epsilon},\bar{\rho}) where Ut1B¯1st1U_{t-1}\in\bar{B}^{s_{t-1}}_{1}, and let UtB¯1stU_{t}\in\bar{B}^{s_{t}}_{1} denote its output Y¯\bar{Y};

  • 2.

    set

    pt=pt1+β(𝒜(UtUtT)b);p_{t}=p_{t-1}+\beta(\mathcal{A}(U_{t}U_{t}^{T})-b); (57)
  • 3.

    if 𝒜(UtUtT)bϵp\|\mathcal{A}(U_{t}U_{t}^{T})-b\|\leq\epsilon_{\mathrm{p}}, then set T=tT=t and return (X¯,p¯,θ¯)=(UTUTT,pT,θ~(UTUTT;pT1))(\bar{X},\bar{p},\bar{\theta})=(U_{T}U_{T}^{T},p_{T},\tilde{\theta}(U_{T}U_{T}^{T};p_{T-1})) where θ~(;)\tilde{\theta}(\cdot;\cdot) is as in (37);

  • 4.

    set t=t+1t=t+1 and go to step 1.

 

Some remarks about each of the steps in HALLaR are now given. First, step 1 invokes HLR to obtain an ϵ¯\bar{\epsilon}-optimal solution UtUtTU_{t}U^{T}_{t} of subproblem (2a) using the previous Ut1B¯1st1U_{t-1}\in\bar{B}_{1}^{s_{t-1}} as initial point. Second, step 2 updates the multiplier ptp_{t} according to a full Lagrange multiplier update. Third, it is shown in Lemma 3.1 below that the triple (UtUtT,pt,θ~(UtUtT;pt1))(U_{t}U_{t}^{T},p_{t},\tilde{\theta}(U_{t}U_{t}^{T};p_{t-1})) always satisfies the dual feasibility and ϵc\epsilon_{\mathrm{c}}-complementarity conditions in (55) where θ~(;)\tilde{\theta}(\cdot;\cdot) is as in (37). Finally, step 3 checks if UtUtTU_{t}U^{T}_{t} is an ϵp\epsilon_{\mathrm{p}}-primal feasible solution. It then follows from the above remarks that if this condition is satisfied, then the triple (UtUtT,pt,θ~(UtUtT;pt1))(U_{t}U^{T}_{t},p_{t},\tilde{\theta}(U_{t}U_{t}^{T};p_{t-1})) is an (ϵp,ϵc)(\epsilon_{\mathrm{p}},\epsilon_{\mathrm{c}})-solution of (55).

Lemma 3.1.

For every iteration index tt, the triple (UtUtT,pt,θ~(UtUtT;pt1))(U_{t}U^{T}_{t},p_{t},\tilde{\theta}(U_{t}U_{t}^{T};p_{t-1})) satisfies the dual feasibility and ϵc\epsilon_{\mathrm{c}}-complementarity conditions in (55) where θ~(;)\tilde{\theta}(\cdot;\cdot) is as in (37).

Proof.

The definitions of q(;)q(\cdot;\cdot) and θ~(;)\tilde{\theta}(\cdot;\cdot) in (37), the second relation in (38) with Y=UtY=U_{t} and p=pt1p=p_{t-1}, and the update rule for ptp_{t} in (57) imply that

β(UtUtT;pt1)=C+𝒜pt,θ~(UtUtT;pt1)=max{λmin(C+𝒜pt),0}.\nabla\mathcal{L}_{\beta}(U_{t}U^{T}_{t};p_{t-1})=C+\mathcal{A}^{*}p_{t},\quad\tilde{\theta}(U_{t}U_{t}^{T};p_{t-1})=\max\{-\lambda_{\min}(C+\mathcal{A}^{*}p_{t}),0\}.

It is then easy to see from the above relation that the triple (UtUtT,pt,θ~(UtUtT;pt1))(U_{t}U^{T}_{t},p_{t},\tilde{\theta}(U_{t}U_{t}^{T};p_{t-1})) always satisfies the dual feasibility condition in (55).

The fact that the definition of ϵ¯\bar{\epsilon} in (56) implies that ϵ¯ϵc\bar{\epsilon}\leq\epsilon_{\mathrm{c}}, the fact that UtUtTU_{t}U^{T}_{t} is an ϵ¯\bar{\epsilon} solution of (2a), and the formula for β(UtUtT;pt1)\nabla\mathcal{L}_{\beta}(U_{t}U^{T}_{t};p_{t-1}) above imply that

0C+𝒜pt+ϵcδΔn(UtUtT).\displaystyle 0\in C+\mathcal{A}^{*}p_{t}+\partial_{\epsilon_{\mathrm{c}}}\delta_{\Delta^{n}}(U_{t}U^{T}_{t}).

It then follows immediately from the above inclusion and relation (14) with Z=UtUtTZ=U_{t}U_{t}^{T}, g=β(;pt1)g=\mathcal{L}_{\beta}(\cdot;p_{t-1}), and θ()=θ~(;pt1)\theta(\cdot)=\tilde{\theta}(\cdot;p_{t-1}) that the triple (UtUtT,pt,θ~(UtUtT;pt1))(U_{t}U^{T}_{t},p_{t},\tilde{\theta}(U_{t}U^{T}_{t};p_{t-1})) always satisfies the ϵc\epsilon_{\mathrm{c}}-complementarity condition in (55). ∎

Before stating the complexity of HALLaR, the following quantities are first introduced:

¯β:=β2𝒜X0b2+5p2+p3p2+2βϵ¯β+3ϵ¯\displaystyle\bar{\mathcal{F}}_{\beta}:=\frac{\beta}{2}\|\mathcal{A}X_{0}-b\|^{2}+\frac{5\|p_{*}\|^{2}+\|p_{*}\|\sqrt{3\|p_{*}\|^{2}+2\beta\bar{\epsilon}}}{\beta}+3\bar{\epsilon} (58)
𝒢¯β:=CF+𝒜(9β𝒜+4p2+2βϵ¯+βb)\displaystyle\bar{\mathcal{G}}_{\beta}:=\|C\|_{F}+\|\mathcal{A}\|\left(9\beta\|\mathcal{A}\|+\sqrt{4\|p_{*}\|^{2}+2\beta\bar{\epsilon}}+\beta\|b\|\right) (59)
κ¯β:=log0+(λ0max{8𝒢¯β+144βA2,1/λ0})/log2\displaystyle\bar{\kappa}_{\beta}:=\left\lceil\log_{0}^{+}\left(\lambda_{0}\max\{8\bar{\mathcal{G}}_{\beta}+144\beta\|A\|^{2},1/\lambda_{0}\}\right)/\log 2\right\rceil (60)

where pp_{*} is an optimal dual solution and ϵ¯\bar{\epsilon} is as in (56).

The main result of this paper is now stated.

Theorem 3.2.

The following statements hold:

  • a)

    HALLaR terminates with an (ϵp,ϵc)(\epsilon_{\mathrm{p}},\epsilon_{\mathrm{c}})-solution (X¯,p¯,θ¯)(\bar{X},\bar{p},\bar{\theta}) of the pair of SDPs (P) and (D) in at most

    𝒥:=3p2β2ϵp2\mathcal{J}:=\left\lceil\frac{3\|p_{*}\|^{2}}{\beta^{2}\epsilon_{\mathrm{p}}^{2}}\right\rceil (61)

    iterations where pp_{*} is an optimal solution to (D);

  • (b)

    HALLaR performs at most

    𝒥𝒫¯β(ϵp,ϵc)\mathcal{J}\cdot{\bar{\mathcal{P}}}_{\beta}(\epsilon_{\mathrm{p}},\epsilon_{\mathrm{c}}) (62)

    and

    𝒥[(1+κ¯β)𝒫¯β(ϵp,ϵc)]\displaystyle\mathcal{J}\left[(1+{\bar{\kappa}_{\beta}})\mathcal{\bar{P}}_{\beta}(\epsilon_{\mathrm{p}},\epsilon_{\mathrm{c}})\right] +Cσ¯βmax{8𝒢¯β+144βA2,1/λ0}ρ¯2\displaystyle+\frac{C_{\sigma}\bar{\mathcal{F}}_{\beta}\max\{8\bar{\mathcal{G}}_{\beta}+144\beta\|A\|^{2},1/\lambda_{0}\}}{\bar{\rho}^{2}} (63)

    total HLR iterations (and hence MEV computations) and total ADAP-FISTA calls, respectively, where

    𝒫¯β(ϵp,ϵc):=1+4max{¯β,4βA2¯β,4βA2}min{ϵc,ϵp2β/6},\mathcal{\bar{P}}_{\beta}(\epsilon_{\mathrm{p}},\epsilon_{\mathrm{c}}):=\left\lceil 1+\frac{4\max\left\{\bar{\mathcal{F}}_{\beta},\sqrt{4\beta\|A\|^{2}\bar{\mathcal{F}}_{\beta}},4\beta\|A\|^{2}\right\}}{\min\{\epsilon_{\mathrm{c}},\;{\epsilon_{\mathrm{p}}^{2}\beta}/{6}\}}\right\rceil, (64)

    ρ¯\bar{\rho} is an input parameter to HALLaR, and ¯β\bar{\mathcal{F}}_{\beta}, CσC_{\sigma}, 𝒢¯β\bar{\mathcal{G}}_{\beta}, and κ¯β{\bar{\kappa}_{\beta}} are as in (58), (25), (59), and (60), respectively.

Two remarks about Theorem 3.2 are now given. First, it follows from statement (a) that HALLaR performs 𝒪(1/(β2ϵp2))\mathcal{O}\left(1/(\beta^{2}\epsilon^{2}_{p})\right) iterations. Second, statement (b) and the definitions of 𝒥\mathcal{J}, 𝒫¯β(ϵp,ϵc)\mathcal{\bar{P}}_{\beta}(\epsilon_{\mathrm{p}},\epsilon_{\mathrm{c}}), and ¯β\bar{\mathcal{F}}_{\beta} in (61), (64), and (58), respectively imply that HALLaR performs

𝒪(1βϵp2min{ϵc,ϵp2β})\mathcal{O}\left(\frac{1}{\beta\epsilon^{2}_{p}\min\{\epsilon_{\mathrm{c}},\epsilon^{2}_{p}\beta\}}\right) (65)

and

𝒪(1βϵp2min{ϵc,ϵp2β}+β2ρ¯2)\mathcal{O}\left(\frac{1}{\beta\epsilon^{2}_{p}\min\{\epsilon_{\mathrm{c}},\epsilon^{2}_{p}\beta\}}+\frac{\beta^{2}}{\bar{\rho}^{2}}\right) (66)

total HLR iterations (and hence MEV computations) and ADAP-FISTA calls, respectively. In contrast to the case where β=𝒪(1)\beta=\mathcal{O}(1), the result below shows that the bounds (65) and (66) can be improved when β=𝒪(1/ϵp)\beta=\mathcal{O}(1/\epsilon_{\mathrm{p}}).

Corollary 3.3.

If β=𝒪(1/ϵp)\beta=\mathcal{O}\left(1/\epsilon_{\mathrm{p}}\right) and ρ¯=βmin{ϵc,ϵp2β/6}\bar{\rho}=\beta\min\{\epsilon_{\mathrm{c}},\epsilon^{2}_{p}\beta/6\}, then HALLaR performs at most

𝒪(1ϵp2+1ϵc2)\mathcal{O}\left(\frac{1}{\epsilon_{\mathrm{p}}^{2}}+\frac{1}{\epsilon_{\mathrm{c}}^{2}}\right) (67)

total HLR iterations (and hence MEV computations) and total ADAP-FISTA calls.

Proof.

The conclusion of the corollary immediately follows from relations (65) and (66) together with the assumptions that β=𝒪(1/ϵp)\beta=\mathcal{O}(1/\epsilon_{\mathrm{p}}) and ρ¯=βmin{ϵc,ϵp2β/6}\bar{\rho}=\beta\min\{\epsilon_{\mathrm{c}},\epsilon^{2}_{p}\beta/6\}. ∎

It can be shown that each ADAP-FISTA call performs at most

𝒪1(2[1+λ0(2𝒢¯β+36β𝒜2)]log1+(1+λ0[2𝒢¯β+36β𝒜2]))\mathcal{O}_{1}\left(\sqrt{2\left[1+\lambda_{0}(2\bar{\mathcal{G}}_{\beta}+36\beta\|\mathcal{A}\|^{2})\right]}\log^{+}_{1}\left(1+\lambda_{0}\left[2\bar{\mathcal{G}}_{\beta}+36\beta\|\mathcal{A}\|^{2}\right]\right)\right) (68)

iterations/resolvent evaluations.111A resolvent evaluation of hh is an evaluation of (I+γh)1()(I+\gamma\partial h)^{-1}(\cdot) for some γ>0\gamma>0. This is an immediate consequence of Lemma C.2(a) in Appendix C, the definition of Lg~L_{\tilde{g}} in (25), the fact that β(X;pt1)\mathcal{L}_{\beta}(X;p_{t-1}) is (β𝒜2)(\beta\|\mathcal{A}\|^{2})-smooth, and Lemma 3.7 which is developed in the next subsection.

3.2   Proof of Theorem 3.2

Since HALLaR calls the HLR method at every iteration, the next proposition specializes Theorem 2.5, which states the complexity of HLR, to the specific case of SDPs. Its statement uses the following quantities associated with an iteration tt of HALLaR:

β(t):=β(Xt1,pt1)minXΔnβ(X,pt1),\displaystyle\mathcal{F}^{(t)}_{\beta}:=\mathcal{L}_{\beta}(X_{t-1},p_{t-1})-\min_{X\in\Delta^{n}}\mathcal{L}_{\beta}(X,p_{t-1}), (69)
𝒢β(t):=sup{β(UUT,pt1):UB¯3st1},\displaystyle{\mathcal{G}}_{\beta}^{(t)}:=\sup\{\|\nabla\mathcal{L}_{\beta}(UU^{T},p_{t-1})\|:U\in\bar{B}^{s_{t-1}}_{3}\}, (70)
κβ(t)=log0+(λ0max{8𝒢β(t)+144βA2,1/λ0})/log2,\displaystyle\kappa_{\beta}^{(t)}=\left\lceil\log_{0}^{+}\left(\lambda_{0}\max\{8{\mathcal{G}}_{\beta}^{(t)}+144\beta\|A\|^{2},1/\lambda_{0}\}\right)/\log 2\right\rceil, (71)
Proposition 3.4.

The following statements about the HLR call in step 1 of the tt-th iteration of the HALLaR hold:

  • (a)

    it outputs UtU_{t} such that Xt=UtUtTX_{t}=U_{t}U^{T}_{t} is an ϵ¯\bar{\epsilon}-optimal solution of

    minXΔnβ(X;pt1)\displaystyle\min_{X\in\Delta^{n}}\mathcal{L}_{\beta}(X;p_{t-1})

    by performing at most

    𝒫β(t)(ϵ¯):=1+4max{β(t),4β𝒜2β(t),4β𝒜2}ϵ¯{\mathcal{P}^{(t)}_{\beta}}(\bar{\epsilon}):=\left\lceil 1+\frac{4\max\left\{\mathcal{F}^{(t)}_{\beta},\sqrt{4\beta\|\mathcal{A}\|^{2}\mathcal{F}^{(t)}_{\beta}},4\beta\|\mathcal{A}\|^{2}\right\}}{\bar{\epsilon}}\right\rceil (72)

    iterations (and hence MEV computations) where β(t)\mathcal{F}^{(t)}_{\beta} and ϵ¯\bar{\epsilon} are as in (69) and (56), respectively;

  • (b)

    the total number of ADAP-FISTA calls within such call is no more than

    (1+κβ(t))𝒫β(t)(ϵ¯)+Cσmax{8𝒢β(t)+144β𝒜2,1/λ0}ρ¯2[β(Xt1,pt1)β(Xt,pt1)]\displaystyle(1+\kappa_{\beta}^{(t)}){\mathcal{P}^{(t)}_{\beta}}(\bar{\epsilon})+\frac{C_{\sigma}\max\{8{\mathcal{G}}_{\beta}^{(t)}+144\beta\|\mathcal{A}\|^{2},1/\lambda_{0}\}}{\bar{\rho}^{2}}\left[\mathcal{L}_{\beta}(X_{t-1},p_{t-1})-\mathcal{L}_{\beta}(X_{t},p_{t-1})\right] (73)

    where λ0\lambda_{0} is given as input to HLR, and CσC_{\sigma}, 𝒢β(t){\mathcal{G}}_{\beta}^{(t)}, and κβ(t)\kappa_{\beta}^{(t)} are as in (26), (70), and (71), respectively.

Proof.

(a) Recall that each HLR call during the tt-th iteration of HALLaR is made with (g,Y¯0,λ0,ϵ¯,ρ¯)=(β(;pt1),Ut1,λ0,ϵ¯,ρ¯)(g,\bar{Y}_{0},\lambda_{0},\bar{\epsilon},\bar{\rho})=(\mathcal{L}_{\beta}(\cdot;p_{t-1}),U_{t-1},\lambda_{0},\bar{\epsilon},\bar{\rho}). The result then immediately follows from Theorem 2.5(a), the definition of β(t)\mathcal{F}^{(t)}_{\beta} in (69), and the fact that β(X;pt1)\mathcal{L}_{\beta}(X;p_{t-1}) is (β𝒜2)(\beta\|\mathcal{A}\|^{2})-smooth.

(b) The proof follows directly from Theorem 2.5(b), statement(a), the fact that β(;pt1)\mathcal{L}_{\beta}(\cdot;p_{t-1}) is (β𝒜2)(\beta\|\mathcal{A}\|^{2})-smooth, and the definitions of 𝒢β(t){\mathcal{G}}_{\beta}^{(t)} and κβ(t)\kappa^{(t)}_{\beta} in (70) and (71), respectively. ∎

The following Lemma establishes key bounds which will be used later to bound quantities β(t)\mathcal{F}^{(t)}_{\beta}, 𝒢β(t)\mathcal{G}^{(t)}_{\beta}, and κβ(t)\kappa^{(t)}_{\beta} that appear in Proposition 3.4.

Lemma 3.5.

The following relations hold:

maxt{0,T}pt\displaystyle\max_{t\in\{0,\ldots T\}}\|p_{t}\| p+3p2+2βϵ¯,\displaystyle\leq\|p_{*}\|+\sqrt{3\|p_{*}\|^{2}+2\beta\bar{\epsilon}}, (74)
βt=1T𝒜Xtb2\displaystyle\beta\sum_{t=1}^{T}\|\mathcal{A}X_{t}-b\|^{2} 3p2β+2ϵ¯,\displaystyle\leq\frac{3\|p_{*}\|^{2}}{\beta}+2\bar{\epsilon}, (75)

where TT is the last iteration index of HALLaR, pp_{*} is an optimal Lagrange multiplier, and ϵ¯\bar{\epsilon} is as in (56).

Proof.

It follows immediately from Proposition 3.4(a) and the definition of ϵ¯\bar{\epsilon}-optimal solution that the HLR call made in step 1 of the tt-th iteration of HALLaR outputs UtU_{t} such that Xt=UtUtTX_{t}=U_{t}U_{t}^{T} satisfies

0β(Xt;pt1)+ϵ¯δΔn.0\in\nabla\mathcal{L}_{\beta}(X_{t};p_{t-1})+\partial_{\bar{\epsilon}}\delta_{\Delta^{n}}. (76)

It is then easy to see that HALLaR is an instance of the AL Framework in Appendix E since the HLR method implements relation (152) in the Blackbox AL with (ϵ^c,ϵ^d)=(ϵ¯,0)(\hat{\epsilon}_{\mathrm{c}},\hat{\epsilon}_{\mathrm{d}})=(\bar{\epsilon},0). The proof of relations (74) and (75) now follows immediately from relations (157) and (158) and the fact that p0=0p_{0}=0. ∎

Lemma 3.6.

For every iteration index tt of HALLaR, the following relations hold:

β(t)¯β\displaystyle\mathcal{F}^{(t)}_{\beta}\leq\bar{\mathcal{F}}_{\beta} (77)
l=1t[β(Xl1,pl1)β(Xl,pl1)]¯β\displaystyle\sum_{l=1}^{t}\left[\mathcal{L}_{\beta}(X_{l-1},p_{l-1})-\mathcal{L}_{\beta}(X_{l},p_{l-1})\right]\leq\bar{\mathcal{F}}_{\beta} (78)

where Xt=UtUtTX_{t}=U_{t}U^{T}_{t} and β(t)\mathcal{F}^{(t)}_{\beta} and ¯β\bar{\mathcal{F}}_{\beta} are as in (69) and (58), respectively.

Proof.

Let tt be an iteration index. We first show that

β(Xt1,pt1)λmin(C)+β2𝒜X0b2+3p2β+2ϵ¯.\mathcal{L}_{\beta}(X_{t-1},p_{t-1})\leq\lambda_{\min}(C)+\frac{\beta}{2}\|\mathcal{A}X_{0}-b\|^{2}+\frac{3\|p_{*}\|^{2}}{\beta}+2\bar{\epsilon}. (79)

holds. It follows immediately from the definition of β(X;p)\mathcal{L}_{\beta}(X;p) in (3), the fact that X0=argminXΔnCXX_{0}=\operatorname*{arg\,min}_{X\in\Delta^{n}}C\bullet X, the Cauchy-Schwarz inequality, and the fact that p0=0p_{0}=0, that

β(X0,p0)\displaystyle\mathcal{L}_{\beta}(X_{0},p_{0}) (3)CX0+β2𝒜(X0)b2=λmin(C)+β2𝒜(X0)b2.\displaystyle\overset{\eqref{AL function}}{\leq}C\bullet X_{0}+\frac{\beta}{2}\|\mathcal{A}(X_{0})-b\|^{2}=\lambda_{\min}(C)+\frac{\beta}{2}\|\mathcal{A}(X_{0})-b\|^{2}. (80)

Hence, relation (79) holds with t=1t=1. Suppose now that t2t\geq 2 and let ll be an iteration index such that l<tl<t. Relation (45), the fact that HALLaR during its ll-th iteration calls the HLR method with input g=β(X,pl1)g=\mathcal{L}_{\beta}(X,p_{l-1}) and initial point Y¯0=Ul1\bar{Y}_{0}=U_{l-1}, and the fact that Xl=UlUlTX_{l}=U_{l}U^{T}_{l} imply that β(Xl,pl1)β(Xl1,pl1)\mathcal{L}_{\beta}(X_{l},p_{l-1})\leq\mathcal{L}_{\beta}(X_{l-1},p_{l-1}) and hence that

β(Xl,pl)β(Xl1,pl1)β(Xl,pl)β(Xl,pl1)=(57),(3)β𝒜(Xl)b2\displaystyle\mathcal{L}_{\beta}(X_{l},p_{l})-\mathcal{L}_{\beta}(X_{l-1},p_{l-1})\leq\mathcal{L}_{\beta}(X_{l},p_{l})-\mathcal{L}_{\beta}(X_{l},p_{l-1})\overset{\eqref{dual update SDP},\eqref{AL function}}{=}\beta\|\mathcal{A}(X_{l})-b\|^{2} (81)

in view of the update rule (57) and the definition of β(;)\mathcal{L}_{\beta}(\cdot;\cdot) in (3). Summing relation (81) from l=1l=1 to t1t-1 and using relation (75) gives

β(Xt1,pt1)β(X0,p0)(81)βl=1t1𝒜(Xl)b2(75)3p2β+2ϵ¯.\displaystyle\mathcal{L}_{\beta}(X_{t-1},p_{t-1})-\mathcal{L}_{\beta}(X_{0},p_{0})\overset{\eqref{successive lagrangian-2}}{\leq}\beta\sum_{l=1}^{t-1}\|\mathcal{A}(X_{l})-b\|^{2}\overset{\eqref{bound on feasibility-SDP}}{\leq}\frac{3\|p_{*}\|^{2}}{\beta}+2\bar{\epsilon}. (82)

Relation (79) now follows by combining relations (80) and (82).

Now, relation (74), the fact that minXΔnCX=λmin(C)\min_{X\in\Delta^{n}}C\bullet X=\lambda_{\min}(C), and the definition of β(;)\mathcal{L}_{\beta}(\cdot;\cdot) in (3), imply that for any t=0,,Tt=0,\ldots,T,

β(X,pt)λmin(C)\displaystyle\mathcal{L}_{\beta}(X,p_{t})-\lambda_{\min}(C) β(X,pt)CX=12ptβ+β(AXb)2pt22β\displaystyle\geq\mathcal{L}_{\beta}(X,p_{t})-C\bullet X=\frac{1}{2}\left\|\frac{p_{t}}{\sqrt{\beta}}+\sqrt{\beta}(AX-b)\right\|^{2}-\frac{\|p_{t}\|^{2}}{2\beta}
(74)2p2+βϵ¯+p3p2+2βϵ¯βXΔn\displaystyle\overset{\eqref{bound on Lagrange multipliers-SDP}}{\geq}-\frac{2\|p_{*}\|^{2}+\beta\bar{\epsilon}+\|p_{*}\|\sqrt{3\|p_{*}\|^{2}+2\beta\bar{\epsilon}}}{\beta}\qquad\forall X\in\Delta^{n} (83)

where TT is the last iteration index of HALLaR. Relations (79) and (83) together with the definition of ¯β\bar{\mathcal{F}}_{\beta} in (58) then imply that β(Xt1,pt1)β(X,pt1)¯β\mathcal{L}_{\beta}(X_{t-1},p_{t-1})-\mathcal{L}_{\beta}(X,p_{t-1})\leq\bar{\mathcal{F}}_{\beta} for every XΔnX\in\Delta^{n} and iteration index tt. Relation (77) then follows immediately from this conclusion and the definition of β(t)\mathcal{F}^{(t)}_{\beta} in (69).

To show relation (78), observe that relations (81) and (75) imply that for any iteration index tt the following relations hold:

l=1tβ(Xl1,pl1)β(Xl,pl1)\displaystyle\sum_{l=1}^{t}\mathcal{L}_{\beta}(X_{l-1},p_{l-1})-\mathcal{L}_{\beta}(X_{l},p_{l-1})
=l=1t[β(Xl1,pl1)β(Xl,pl)]+l=1t[β(Xl,pl)β(Xl,pl1)]\displaystyle=\sum_{l=1}^{t}\left[\mathcal{L}_{\beta}(X_{l-1},p_{l-1})-\mathcal{L}_{\beta}(X_{l},p_{l})\right]+\sum_{l=1}^{t}\left[\mathcal{L}_{\beta}(X_{l},p_{l})-\mathcal{L}_{\beta}(X_{l},p_{l-1})\right]
=(81)β(X0,p0)β(Xt,pt)+βl=1t𝒜(Xl)b2(75)β(X0,p0)β(Xt,pt)+3p2β+2ϵ¯.\displaystyle\overset{\eqref{successive lagrangian-2}}{=}\mathcal{L}_{\beta}(X_{0},p_{0})-\mathcal{L}_{\beta}(X_{t},p_{t})+\beta\sum_{l=1}^{t}\|\mathcal{A}(X_{l})-b\|^{2}\overset{\eqref{bound on feasibility-SDP}}{\leq}\mathcal{L}_{\beta}(X_{0},p_{0})-\mathcal{L}_{\beta}(X_{t},p_{t})+\frac{3\|p_{*}\|^{2}}{\beta}+2\bar{\epsilon}.

Relation (78) then follows immediately from the above relation, relation (83) with X=XtX=X_{t}, relation (80), and the definition of ¯β\bar{\mathcal{F}}_{\beta} in (58). ∎

Lemma 3.7.

For every iteration tt of HALLaR, we have 𝒢β(t)𝒢¯β{\mathcal{G}}_{\beta}^{(t)}\leq\bar{\mathcal{G}}_{\beta} where 𝒢β(t){\mathcal{G}}_{\beta}^{(t)} and 𝒢¯β\bar{\mathcal{G}}_{\beta} are as in (70) and (59), respectively.

Proof.

Let tt be an iteration index of HALLaR and suppose that UB¯3st1U\in\bar{B}^{s_{t-1}}_{3}. It is easy to see from the definition of β(;)\mathcal{L}_{\beta}(\cdot;\cdot) in (3) that

β(UUT;pt1)=C+𝒜pt1+β𝒜(𝒜(UUT)b).\displaystyle\nabla\mathcal{L}_{\beta}(UU^{T};p_{t-1})=C+\mathcal{A}^{*}p_{t-1}+\beta\mathcal{A}^{*}(\mathcal{A}(UU^{T})-b). (84)

It then follows from the fact that UB¯3st1U\in\bar{B}^{s_{t-1}}_{3}, Cauchy-Schwarz inequality, triangle inequality, relation (84), and bound (74) that

β(UUT;pt1)F\displaystyle\|\nabla\mathcal{L}_{\beta}(UU^{T};p_{t-1})\|_{F} =(84)C+𝒜pt1+β𝒜(𝒜(UUT)b)F\displaystyle\overset{\eqref{def of gradient Lagrangian}}{=}\|C+\mathcal{A}^{*}p_{t-1}+\beta\mathcal{A}^{*}(\mathcal{A}(UU^{T})-b)\|_{F}
CF+𝒜pt1+β𝒜2UUTF+β𝒜b\displaystyle\leq\|C\|_{F}+\|\mathcal{A}\|\|p_{t-1}\|+\beta\|\mathcal{A}\|^{2}\|UU^{T}\|_{F}+\beta\|\mathcal{A}\|\,\|b\|
CF+𝒜pt1+9β𝒜2+β𝒜b\displaystyle\leq\|C\|_{F}+\|\mathcal{A}\|\|p_{t-1}\|+9\beta\|\mathcal{A}\|^{2}+\beta\|\mathcal{A}\|\,\|b\|
(74)CF+𝒜(9β𝒜+p+3p2+2βϵ¯+βb)\displaystyle\overset{\eqref{bound on Lagrange multipliers-SDP}}{\leq}\|C\|_{F}+\|\mathcal{A}\|\left(9\beta\|\mathcal{A}\|+\|p_{*}\|+\sqrt{3\|p_{*}\|^{2}+2\beta\bar{\epsilon}}+\beta\|b\|\right)

which immediately implies the result of the lemma in view of the definitions of 𝒢β(t){\mathcal{G}}_{\beta}^{(t)} and 𝒢¯β\bar{\mathcal{G}}_{\beta} in (70) and (59), respectively. ∎

We are now ready to prove Theorem 3.2.

Proof of Theorem 3.2.

(a) It follows immediately from Lemma 3.1 that output (X¯,p¯,θ¯)(\bar{X},\bar{p},\bar{\theta}) satisfies the the dual feasibility and ϵc\epsilon_{\mathrm{c}}-complementarity conditions in (55). It then remains to show that the triple (X¯,p¯,θ¯)(\bar{X},\bar{p},\bar{\theta}) satisfies the ϵp\epsilon_{\mathrm{p}}-primal feasibility condition in (55) in at most 𝒥\mathcal{J} iterations where 𝒥\mathcal{J} is as in (61). To show this, it suffices to show that HALLaR is an instance of the AL framework analyzed in Appendix E. Observe first that (P) is a special case of (148) with f(X)=CXf(X)=C\bullet X and h(X)=δΔn(X)h(X)=\delta_{\Delta^{n}}(X). It is also easy to see that the call to the HLR in step 1 of HALLaR is a special way of implementing step 1 of the AL framework, i.e., the Blackbox AL. Indeed, Proposition 3.4(a) implies that the output UtU_{t} of HLR satisfies 0β(UtUtT;pt1)+ϵ¯δΔn(UtUtT)0\in\nabla\mathcal{L}_{\beta}(U_{t}U^{T}_{t};p_{t-1})+\partial_{\bar{\epsilon}}\delta_{\Delta^{n}}(U_{t}U_{t}^{T}) and hence HLR implements relation (152) in the Blackbox AL with (X^,R^)=(UtUtT,0)(\hat{X},\hat{R})=(U_{t}U_{t}^{T},0), (ϵ^c,ϵ^d)=(ϵ¯,0)(\hat{\epsilon}_{\mathrm{c}},\hat{\epsilon}_{\mathrm{d}})=(\bar{\epsilon},0), g=β(;pt1)g=\mathcal{L}_{\beta}(\cdot;p_{t-1}), and h=δΔn()h=\delta_{\Delta^{n}}(\cdot).

In view of the facts that HALLaR is an instance of the AL framework and p0=0p_{0}=0, it then follows immediately from Theorem E.1 that HALLaR terminates within the number of iterations in (61) and that the output (X¯,p¯,θ¯)(\bar{X},\bar{p},\bar{\theta}) satisfies the ϵp\epsilon_{\mathrm{p}}-primal feasibility condition in (55).

(b) Consider the quantity 𝒫β(t)(ϵ¯)\mathcal{P}^{(t)}_{\beta}(\bar{\epsilon}) as in (72) where tt is an iteration index of HALLaR. It follows immediately from relations (56) and (77) and the definition of 𝒫¯β(ϵp,ϵc)\mathcal{\bar{P}}_{\beta}(\epsilon_{\mathrm{p}},\epsilon_{\mathrm{c}}) in (64) that 𝒫β(t)(ϵ¯)𝒫¯β(ϵp,ϵc)\mathcal{P}^{(t)}_{\beta}(\bar{\epsilon})\leq\mathcal{\bar{P}}_{\beta}(\epsilon_{\mathrm{p}},\epsilon_{\mathrm{c}}). Hence, it follows from Proposition 3.4(a) that each HLR call made in step 1 of HALLaR performs at most 𝒫¯β(ϵp,ϵc)\mathcal{\bar{P}}_{\beta}(\epsilon_{\mathrm{p}},\epsilon_{\mathrm{c}}) iterations/MEV computations. The result then follows from this conclusion and part (a).

(c) Let tt be an iteration index of HALLaR. Lemma 3.7 implies that 𝒢β(t)𝒢¯β{\mathcal{G}}_{\beta}^{(t)}\leq\bar{\mathcal{G}}_{\beta} and κβ(t)κ¯β\kappa^{(t)}_{\beta}\leq{\bar{\kappa}_{\beta}} in view of the definitions of κβ(t)\kappa^{(t)}_{\beta} and κ¯β{\bar{\kappa}_{\beta}} in (71) and (60), respectively. Hence, it follows from this conclusion, the fact that 𝒫β(t)(ϵ¯)𝒫¯β(ϵp,ϵc)\mathcal{P}^{(t)}_{\beta}(\bar{\epsilon})\leq\mathcal{\bar{P}}_{\beta}(\epsilon_{\mathrm{p}},\epsilon_{\mathrm{c}}), Proposition 3.4(b), and part (a) that the total number of ADAP-FISTA calls performed by HALLaR is at most

𝒥[(1+κ¯β)]𝒫¯β(ϵp,ϵc)+Cσmax{8𝒢¯β+144β𝒜2,1/λ0}ρ¯2[t=1Tβ(Xt1,pt1)β(Xt,pt1)]\displaystyle\mathcal{J}\left[(1+{\bar{\kappa}_{\beta}})\right]\mathcal{\bar{P}}_{\beta}(\epsilon_{\mathrm{p}},\epsilon_{\mathrm{c}})+\frac{C_{\sigma}\max\{8\bar{\mathcal{G}}_{\beta}+144\beta\|\mathcal{A}\|^{2},1/\lambda_{0}\}}{\bar{\rho}^{2}}\left[\sum_{t=1}^{T}\mathcal{L}_{\beta}(X_{t-1},p_{t-1})-\mathcal{L}_{\beta}(X_{t},p_{t-1})\right]

where 𝒥\mathcal{J} is as in (61) and TT is the last iteration index of HALLaR. The result in (c) then follows immediately from the above relation together with the fact that relation (78) implies that t=1T[β(Xt1,pt1)β(Xt,pt1)]¯β\sum_{t=1}^{T}\left[\mathcal{L}_{\beta}(X_{t-1},p_{t-1})-\mathcal{L}_{\beta}(X_{t},p_{t-1})\right]\leq\bar{\mathcal{F}}_{\beta}. ∎

4   Computational experiments

In this section the performance of HALLaR is tested against state-of-the art SDP solvers. The experiments are performed on a 2019 Macbook Pro with an 8-core CPU and 32 GB of memory. The methods are tested in SDPs arising from the following applications: maximum stable set, phase retrieval, and matrix completion.

This section is organized into five subsections. The first subsection provides details on the implementation of HALLaR. The second subsection explains the SDP solvers considered in the experiments. The remaining subsections describe the results of the computational experiments in each of the applications.

4.1   Implementation details

Our implementation of HALLaR uses the Julia programming language. The implementation applies to a class of SDPs slightly more general than (P). Let 𝔽{,}{\mathbb{F}}\in\{{\mathbb{R}},{\mathbb{C}}\} be either the field of real or complex numbers. Let 𝕊n()\mathbb{S}^{n}({\mathbb{R}}) (resp. 𝕊n()\mathbb{S}^{n}({\mathbb{C}})) be the space of n×nn\times n symmetric (resp. complex Hermitian) matrices, with Frobenius inner product \bullet and with positive semidefinite partial order \succeq. The implementation applies to SDPs of the form

minX{CX:𝒜X=b,trXτ,X0,X𝕊n(𝔽)}\displaystyle\min_{X}\quad\{C\bullet X\quad:\quad\mathcal{A}X=b,\quad\operatorname{tr}X\leq\tau,\quad X\succeq 0,\quad X\in\mathbb{S}^{n}({\mathbb{F}})\}

where bmb\in{\mathbb{R}}^{m}, C𝕊(𝔽)nC\in\mathbb{S}({\mathbb{F}})^{n}, 𝒜:𝕊(𝔽)nm\mathcal{A}:\mathbb{S}({\mathbb{F}})^{n}\to{\mathbb{R}}^{m} is a linear map, and 𝒜:m𝕊n(𝔽)\mathcal{A}^{*}:{\mathbb{R}}^{m}\to\mathbb{S}^{n}({\mathbb{F}}) is its adjoint. In contrast to (P), the trace of XX is bounded by τ\tau instead of one. The inputs for our implementation are the initial points U0U_{0}, p0p_{0}, the tolerances ϵc\epsilon_{\mathrm{c}}, ϵp\epsilon_{\mathrm{p}}, and the data describing the SDP instance, which is explained below. In the experiments, the primal initial point U0U_{0} is a random n×1n\times 1 matrix with entries generated independently from the Gaussian distribution over 𝔽{\mathbb{F}}, and the dual initial point p0p_{0} is the zero vector. Our computational results below are based on a variant of HALLaR, also referred to as HALLaR in this section, which differs slightly from one described in Section 3 in that the penalty parameter β\beta and the tolerance ϵ¯\bar{\epsilon} for the AL subproblem are chosen in an adaptive manner based on some of the ideas of the LANCELOT method [16].

The data defining the SDP instance involves matrices of size n×nn\times n which should not be stored as dense arrays in large scale settings. Instead of storing a matrix M𝕊n(𝔽)M\in\mathbb{S}^{n}({\mathbb{F}}), it is assumed that a routine that evaluates the linear operator 𝔏(M):𝔽n𝔽n\mathfrak{L}(M):{\mathbb{F}}^{n}\to{\mathbb{F}}^{n}, vMvv\mapsto Mv is given by the user.

Similar to the Sketchy-CGAL method of [64], our implementation of HALLaR requires the following user inputs to describe the SDP instance:

  1. (i)

    The vector bmb\in{\mathbb{R}}^{m} and the scalar τ>0\tau>0.

  2. (ii)

    A routine for evaluating the linear operator 𝔏(C)\mathfrak{L}(C).

  3. (iii)

    A routine for evaluating linear operators of the form 𝔏(𝒜p)\mathfrak{L}(\mathcal{A}^{*}p) for any pmp\in{\mathbb{R}}^{m}.

  4. (iv)

    A routine for evaluating the quadratic function

    q𝒜:𝔽nm,y𝒜(yyT).\displaystyle q_{\mathcal{A}}:{\mathbb{F}}^{n}\to{\mathbb{R}}^{m},\quad y\mapsto\mathcal{A}(yy^{T}). (85)

Note that the routine in (iv) allows 𝒜\mathcal{A} to be evaluated on any matrix in factorized form since 𝒜(YYT)=iq𝒜(yi)\mathcal{A}(YY^{T})=\sum_{i}q_{\mathcal{A}}(y_{i}) where the sum is over the columns yiy_{i} of YY. In addition, the routines in (ii) and (iii) allow to multiply matrices of the form C+𝒜pC+\mathcal{A}^{*}p with a matrix YY by multiplying by each of the columns yiy_{i} separately. It follows that all steps of HALLaR (including the steps of HLR and ADAP-AIPP) can be performed by using only the above inputs. For instance, the eigenvalue computation in Step 2 of HLR is performed using iterative Krylov methods, which only require matrix-vector multiplications.

4.2   Competing methods

We compare HALLaR against the following SDP solvers:

  • CSDP : Open source Julia solver based on interior point methods;

  • COSMO: Open source Julia solver based on ADMM/operator splitting;

  • SDPLR : Open source MATLAB solver based on Burer and Monteiro’s LR method;

  • SDPNAL+ : Open MATLAB source solver based on AL with a semismooth Newton method;

  • T-CGAL : Thin variant of the CGAL method [62] that stores the iterates XtX_{t} in factored form;

  • rr-Sketchy : Low-rank variant of CGAL that only stores a sketch of π(Xt)𝔽n×r\pi(X_{t})\in{\mathbb{F}}^{n\times r} of the iterates Xt𝕊n(𝔽)X_{t}\in\mathbb{S}^{n}({\mathbb{F}}).

We use the default parameters in all methods. The rr-Sketchy method is tested with two possible values of rr, namely, r=10r=10 and r=100r=100.

Given a tolerance ϵ>0\epsilon>0, all methods, except SDPLR, stop when a primal solution XΔnX\in\Delta^{n} and a dual solution (p,θ,S)m×+×𝕊n+(p,\theta,S)\in\mathbb{R}^{m}\times\mathbb{R}_{+}\times\mathbb{S}^{+}_{n} satisfying

𝒜Xb1+bϵ,|pvaldval|1+|pval|+|dval|ϵ,C+𝒜p+θIS1+Cϵ,\displaystyle\frac{\|\mathcal{A}X-b\|}{1+\|b\|}\leq\epsilon,\qquad\frac{|\mathrm{pval}-\mathrm{dval}|}{1+|\mathrm{pval}|+|\mathrm{dval}|}\leq\epsilon,\qquad\frac{\|C+\mathcal{A}^{*}p+\theta I-S\|}{1+\|C\|}\leq\epsilon, (86)

is generated, where pval\mathrm{pval} and dval\mathrm{dval} denote the primal and dual values of the solution, respectively. SDPLR terminates solely based off primal feasibility, i.e., it terminates when the first condition in (86) is satisfied. The above conditions are standard in the SDP literature, although some solvers use the ll_{\infty} norm instead of the Euclidean norm. Given a vector pmp\in\mathbb{R}^{m}, HALLaR, SDPLR, r-Sketchy, and T-CGAL, set θ:=max{λmin(C+𝒜p),0}\theta:=\max\{-\lambda_{\min}(C+\mathcal{A}^{*}p),0\} and S:=C+𝒜p+θIS:=C+\mathcal{A}^{*}p+\theta I. The definition of θ\theta implies that S0S\succeq 0 and that the left-hand side of the last inequality in (86) is zero.

Recall that a description of rr-Sketchy is already given in the paragraph preceding the part titled "Structure of the paper" in the Introduction. We now comment on how this method keeps track of its iterates and how it terminates. First, it never stores the iterate UtU_{t} but only a rank rr approximation U~t\tilde{U}_{t} of it as already mentioned in the aforementioned paragraph of the Introduction. (We refer to UtU_{t} as the implicit iterate as is never computed and U~t\tilde{U}_{t} as the computed one.) Second, it keeps track of the quantities C(UtUtT)C\bullet(U_{t}U_{t}^{T}) and 𝒜(UtUtT)\mathcal{A}(U_{t}U^{T}_{t}) which, as can be easily verified, allow the first two relative errors in (86) with X=UtUtTX=U_{t}U_{t}^{T} to be easily evaluated. Third, we kept the termination criterion for rr-Sketchy code intact in that it still stops when all three errors in (86) computed with the implicit solution UtU_{t}, i.e., with X=UtUtTX=U_{t}U_{t}^{T}, are less than or equal to ε\varepsilon. The fact that rr-Sketchy terminates based on the implicit solution does not guarantee, and is even unlikely, that it would terminate based on the computed solution. Fourth, if rr-Sketchy does not terminate within the time limit specified for each problem class, the maximum of the three errors in (86) at its final computed solution U~t\tilde{U}_{t}, i.e., with X=U~tU~tTX=\tilde{U}_{t}\tilde{U}_{t}^{T}, is reported.

The solver COSMO includes an optional chordal decomposition pre-processing step (thoroughly discussed in [57, 24, 26]), which has not been invoked in the computational experiments reported below. This ensures that all solvers are compared over the same set of SDP instances.

The tables in the following three subsections present the results of the computational experiments performed on large collections of maximum stable set, phase retrieval, and matrix completion, SDP instances. A relative tolerance ϵ=105\epsilon=10^{-5} is set and a time limit of either 10000 seconds (3\approx 3 hours) or 14400 seconds (== 4 hours) is given. An entry of a table marked with /N*/N (resp., **) means that the corresponding method finds an approximate solution (resp., crashed) with relative accuracy strictly larger than the desired accuracy 10510^{-5} in which case NN expresses the maximum of the three final relative accuracies in (86). For rr-Sketchy, entries marked as /N*/N mean that it did not terminate within the time limit and the maximum of the three final relative accuracies in (86) of its final computed solution U~t\tilde{U}_{t} was NN. The bold numbers in the tables indicate the algorithm that had the best runtime for that instance.

4.3   Maximum stable set

Problem Instance Runtime (seconds)
Graph(nn; |E||E|) HALLaR T-CGAL 10-Sketchy 100-Sketchy CSDP COSMO SDPNAL+
G1(800; 19,176) 218.23 */.13e-02 */.31e-01 */.31e-01 226.01 322.91 60.30
G10(800; 19,176) 241.51 */.20e-02 */.78e-01 */.31e-01 220.44 229.22 55.30
G11(800; 1,600) 3.01 1631.45 220.59 234.76 3.74 118.66 73.50
G12(800; 1,600) 3.06 414.27 72.56 57.03 3.12 9531.99 70.20
G14(800; 4,694) 66.46 */.27e-03 6642.00 7231.30 9.31 3755.36 115.30
G20(800; 4,672) 439.37 */.37e-03 */.12e-01 */.12e-01 11.42 */.69e-04 341.20
G43(1000; 9,990) 96.79 */.25e-04 */.51e-01 */.26e-01 42.39 */.10e-02 62.10
G51(1,000; 5,909) 190.98 */.23e-03 */.98e-02 */.99e-02 16.97 */.33e-04 284.40
G23(2,000; 19,990) 390.34 */.64e-03 */.86e-01 */.19e-01 288.77 5739.78 503.80
G31(2,000; 19,990) 357.75 */.68e-03 */.20e-01 */.19e-01 290.72 5946.84 498.30
G32(2,000; 4,000) 3.62 3732.70 329.17 349.73 29.04 */.58e-03 853.90
G34(2,000; 4,000) 3.52 1705.60 162.85 177.84 29.04 458.11 1101.80
G35(2,000; 11,778) 730.54 */.78e-03 */.62e-02 */.62e-02 730.54 120.60 2396.60
G41(2,000; 11,785) 555.02 */.17e-02 */.59e-02 */.59e-02 114.73 */.37e-03 2027.20
G48(3,000; 6,000) 3.49 4069.30 288.64 306.91 81.97 1840.91 6347.50
G55(5,000; 12,498) 253.22 */.33e-02 */.80e-02 */.79e-02 535.57 */.11e-02 */.17e-01
G56(5,000; 12,498) 264.46 */.38e-02 */.80e-02 */.79e-02 523.06 */.27e-02 */.20e00
G57(5,000; 10,000) 4.14 7348.50 791.75 831.06 336.93 8951.40 */.10e00
G58(5,000; 29,570) 2539.83 */.96e-02 */.24e-01 */.43e-02 2177.03 */.20e-03 */.43e-01
G59(5,000; 29,570) 2625.47 */.54e-02 */.24e-01 */.43e-02 2178.33 */.37e+02 */.65e-01
G60(7,000; 17,148) 476.65 */.21e-02 */.13e-01 */.67e-02 2216.50 */.39e+02 */.10e+01
G62(7,000; 14,000) 5.00 */.28e-03 1795.50 1474.40 1463.18 */.12e-03 */.10e+01
G64(7,000; 41,459) 3901.68 */.16e-01 */.36e-02 */.36e-02 7127.72 */.99e+01 */.10e+01
G66(9,000; 18,000) 5.77 */.82e-01 2788.70 3022.70 2076.17 */.77e-03 */.10e+01
G67(10,000; 20,000) 5.87 */.13e-01 3725.80 3941.70 7599.80 ** */.47e+01
G72(10,000; 20,000) 5.92 */.11e00 3936.30 3868.60 7450.01 ** */.47e+01
G77(14,000; 28,000) 8.08 */.24e00 */.60e-02 */.60e-02 ** ** */.99e00
G81(20,000; 40,000) 10.89 */.10e00 */.91e-01 */.71e-01 ** ** */.10e+01
tor(69,192; 138,384) 40.64 */.38e00 */.63e00 */.29e00 ** ** **
Table 1: Runtimes (in seconds) for the Maximum Stable Set problem. A relative tolerance of ϵ=105\epsilon=10^{-5} is set and a time limit of 10000 seconds is given. An entry marked with /N*/N (resp., **) means that the corresponding method finds an approximate solution (resp., crashed) with relative accuracy strictly larger than the desired accuracy in which case NN expresses the maximum of the three relative accuracies in (86).

Given a graph G=([n],E)G=([n],E), the maximum stable set problem consists of finding a subset of vertices of largest cardinality such that no two vertices are connected by an edge. Lovász [40] introduced a constant, the ϑ\vartheta-function, which upper bounds the value of the maximum stable set. The ϑ\vartheta-function is the value of the SDP

max{eeTX:Xij=0,ijE,trX=1,X0,X𝕊n()}\displaystyle\max\quad\{{e}{e}^{T}\bullet X\;:\;X_{ij}=0,\;ij\in E,\;\operatorname{tr}X=1,\;X\succeq 0,\quad X\in\mathbb{S}^{n}({\mathbb{R}})\} (87)

where e=(1,1,,1)ne=(1,1,\dots,1)\in{\mathbb{R}}^{n} is the all ones vector. It was shown in [27] that the ϑ\vartheta-function agrees exactly with the stable set number for perfect graphs.

Problem Instance Runtime (seconds)
Graph(nn; |E||E|) HALLaR T-CGAL 10-Sketchy 100-Sketchy
H13,2H_{13,2}(8,192; 53,248) 5.04 */.23e00 1603.80 882.03
H14,2H_{14,2}(16,384; 114,688) 9.09 */.45e00 6058.60 6712.20
H15,2H_{15,2}(32,768; 245,760) 65.22 */.19e00 */.19e-01 */.14e-01
H16,2H_{16,2}(65,536; 524,288) 104.71 */.11e-01 */.24e00 */.11e-01
H17,2H_{17,2}(131,072; 1,114,112) 69.63 */.34e-01 */.72e00 */.32e-01
H18,2H_{18,2}(262,144; 2,359,296) 244.90 */.99e-02 */.88e-02 */.31e00
H19,2H_{19,2}(524,288; 4,980,736) 786.73 */.42e00 */.35e00 */.24e00
H20,2H_{20,2}(1,048,576; 10,485,760) 1157.96 */.47e00 */.31e-02 */.31e-02
Table 2: Runtimes (in seconds) for the Maximum Stable Set problem. A relative tolerance of ϵ=105\epsilon=10^{-5} is set and a time limit of 14400 seconds (4 hours) is given. An entry marked with /N*/N (resp., **) means that the corresponding method finds an approximate solution (resp., crashed) with relative accuracy strictly larger than the desired accuracy in which case NN expresses the maximum of the three relative accuracies in (86).
Problem Instance Runtime (seconds)
Graph(nn; |E||E|) HALLaR
H21,2H_{21,2}(2,097,152; 22,020,096) 2934.33
H22,2H_{22,2}(4,194,304; 46,137,344) 6264.50
H23,2H_{23,2}(8,388,608; 96,468,992) 14188.23
H24,2H_{24,2}(16,777,216; 201,326,592) 46677.82
Table 3: Runtimes (in seconds) for the Maximum Stable Set problem. A relative tolerance of ϵ=105\epsilon=10^{-5} is set.
Problem Instance Runtime (seconds)
Problem Size (n;m)(n;m) Graph Dataset HALLaR
10,937; 75,488 wing_nodal DIMACS10 1918.48
16,384; 49,122 delaunay_n14 DIMACS10 1355.01
16,386; 49,152 fe-sphere DIMACS10 147.93
22,499; 43,858 cs4 DIMACS10 747.66
25,016; 62,063 hi2010 DIMACS10 3438.06
25,181; 62,875 ri2010 DIMACS10 2077.97
32,580; 77,799 vt2010 DIMACS10 2802.37
48,837; 117,275 nh2010 DIMACS10 8530.38
24,300; 34,992 aug3d GHS_indef 8.56
32,430; 54,397 ia-email-EU Network Repo 530.21
11,806; 32,730 Oregon-2 SNAP 2787.19
11,380; 39,206 wiki-RFA_negative SNAP 1151.31
21,363; 91,286 ca-CondMat SNAP 7354.75
31,379; 65,910 as-caida_G_001 SNAP 3237.93
26,518; 65,369 p2p-Gnutella24 SNAP 344.83
22,687; 54,705 p2p-Gnutella25 SNAP 235.03
36,682; 88,328 p2p-Gnutella30 SNAP 542.07
62,586; 147,892 p2p-Gnutella31 SNAP 1918.30
49,152; 69,632 cca AG-Monien 47.24
49,152; 73,728 ccc AG-Monien 12.14
49,152; 98,304 bfly AG-Monien 13.15
16,384; 32,765 debr_G_12 AG-Monien 818.61
32,768; 65,533 debr_G_13 AG-Monien 504.29
65,536; 131,069 debr_G_14 AG-Monien 466.67
131,072; 262,141 debr_G_15 AG-Monien 488.07
262,144; 524,285 debr_G_16 AG-Monien 1266.71
524,288; 1,048,573 debr_G_17 AG-Monien 5793.57
1,048,576; 2,097,149 debr_G_18 AG-Monien 13679.12
Table 4: Runtimes (in seconds) for the Maximum stable set problem. A relative tolerance of ϵ=105\epsilon=10^{-5} is set.
Problem Instance Runtime (seconds)
Graph(nn; |E||E|) HALLaR SDPLR
H10,2H_{10,2}(1024; 5120) 2.90 1.28
H11,2H_{11,2}(2048; 11264) 3.03 10.14
H12,2H_{12,2}(4096; 24576) 3.49 56.60/.12e-03
H13,2H_{13,2}(8192; 53248) 5.04 399.89/.38e-03
H14,2H_{14,2}(16384; 114688) 9.09 2469.11/.16e-02
H15,2H_{15,2}(32768; 245760) 65.22 */.11e-01/.46e00
Table 5: Runtimes (in seconds) for the maximum stable set problem. A relative tolerance of ϵ=105\epsilon=10^{-5} is set and a time limit of 14400 seconds (4 hours) is given. An entry marked with */N1/N2 means that the corresponding method finds an approximate solution that satisfies the first relation in (86) with relative accuracy strictly larger than the desired accuracy of 10510^{-5} in which case N1N1 (resp., N2N2) expresses the final accuracy that the method satisfies the first (resp., second) relation in (86) with.

Tables 1, 2, 3, 4, and 5 present the results of the computational experiments performed on the maximum stable set SDP. Table 1 compares HALLaR against all the methods listed in Subsection 4.2 on smaller graph instances, i.e., with number of vertices not exceeding 70,000. All graph instances considered, except the last instance, are taken from the GSET data set, a curated collection of randomly generated graphs that can be found in [61]. The larger GSET graphs (GSET 66-81) are all toroidal graphs where every vertex has degree 4. The last graph instance presented in Table 1 is a large toroidal graph with approximately 70,000 vertices that we generated ourselves. A time limit of 10000 seconds (approximately 3 hours) is given. Table 2 compares HALLaR against T-CGAL, 1010-Sketchy, and 100-Sketchy, on large graph instances with up to 1 million vertices and 10 million edges. CSDP was not included in Table 2 since it crashed on all but one of the instances included in it. All graph instances considered in Table 2 are Hamming H(d,2)H(d,2) graphs, a special class of graphs that has 2d2^{d} number of vertices and d 2d1d\,2^{d-1} number of edges. The vertex set of such graphs can be seen as corresponding to binary words of length dd, and the edges correspond to binary words that differ in one bit. A time limit of 14400 seconds (4 hours) is now given.

Tables 3 and 4 solely present the performance of HALLaR on extremely large-sized Hamming instances (i.e., with number of vertices exceeding 2 millon) and hard graph instances from real-world datasets, respectively. The graph instances considered in Table 4 are taken from the DIMACS10, Stanford SNAP, AG-Monien, GHS_indef, and Network Repositories [39, 53, 17, 1].

Table 5 displays a special comparison between HALLaR and SDPLR on 6 different Hamming graphs. Recall that SDPLR terminates only based off the first condition in (86) and hence often finds a solution that does not satisfy the second condition in (86) with the desired accuracy of ϵ=105\epsilon=10^{-5}. An entry marked with time/N in Table 5 means that the corresponding method finds an approximate solution (within the time limit) that satisfies the first relation in (86) with ϵ=105\epsilon=10^{-5} but does not satisfy the second relation in (86) with the desired accuracy in which case NN expresses the final accuracy that the method satisfies the second relation in (86) with. An entry marked with */N1/N2 means that SDPLR finds an approximate solution that satisfies the first relation in (86) with relative accuracy strictly larger than the desired accuracy of 10510^{-5} in which case N1N1 (resp., N2N2) expresses the final accuracy that SDPLR satisfies the first (resp., second) relation in (86) with.

Remarks about the results presented in Tables 1, 2, 3, 4, and 5 are now given. As seen from Table 1, CSDP and HALLaR are the two best performing methods on these smaller graph instances. HALLaR, however, is the only method that can solve each of the instances to the desired accuracy of 10510^{-5} within the time limit of approximately 3 hours. On graph instances where the number of vertices exceeds 14,000 (resp., 10,000), CSDP (resp., COSMO) cannot perform a single iteration within 3 hours or crashes. SDPNAL+ crashed with a lack of memory error on the last graph instance with 69,192 vertices. Although T-CGAL, 10-Sketchy, and 100-Sketchy do not perform especially well on the smaller graph instances considered in Table 1, they are included for comparison on the larger graph instances considered in Table 2 since they require considerably less memory than CSDP, COSMO, and SDPNAL+. The results presented in Table 2 demonstrate that HALLaR performs especially well for larger instances as it is the only method that can solve all instances within the time limit of 4 hours. T-CGAL, 10-Sketchy, and 100-Sketchy cannot find a solution with the desired accuracy of 10510^{-5} on most of the instances considered, often finding solutions with accuracies on the range of 10010^{0} to 10210^{-2}. CSDP was tested on the problems considered in Table 2 but not included for comparison since it crashed on every instance except one. COSMO and SDPNAL+ are not included for comparison due to their high memory requirements.

Tables 3 and 4 show that HALLaR can solve extremely large Hamming instances and hard real-world instances, respectively, within a couple of hours. As seen from Table 3, HALLaR can solve a Hamming instance with 4 million vertices and 40 million edges (resp. 16 million vertices and 200 million edges) in under 2 hours (resp., 13 hours). Table 4 shows that HALLaR can solve a huge Debruijin graph instance (which arises in the context of genome assembly) in just a few hours.

The results presented in Table 5 display the superior performance of HALLaR compared to SDPLR on six different Hamming graphs. HALLaR not only finds more accurate solutions than SDPLR within the time limit of 4 hours but is also at least 80 times faster than SDPLR on the three largest instances.

4.4   Phase retrieval

Given mm pairs {(ai,bi)}i=1mn×+\{(a_{i},b_{i})\}_{i=1}^{m}\subseteq{\mathbb{C}}^{n}\times{\mathbb{R}}_{+}, consider the problem of finding a vector xnx\in\mathbb{C}^{n} such that

|ai,x|2=bi,i=1,,m.\displaystyle|\langle a_{i},x\rangle|^{2}=b_{i},\quad i=1,\dots,m.

In other words, the goal is to retrieve xx from the magnitude of mm linear measurements. By creating the complex Hermitian matrix X=xxHX=xx^{H}, this problem can be approached by solving the complex-valued SDP relaxation

minX{tr(X):aiaiH,X=bi,X0,X𝕊n()}.\min_{X}\quad\left\{\operatorname{tr}(X)\quad:\quad\langle a_{i}a_{i}^{H},X\rangle=b_{i},\quad X\succeq 0,\quad X\in\mathbb{S}^{n}({\mathbb{C}})\right\}.

The motivation of the trace objective function is that it promotes obtaining a low rank solution. It was shown in [12] that the relaxation is tight (i.e., the vector xx can be retrieved from the SDP solution XX) when the vectors aia_{i} are sampled independently and uniformly on the unit sphere. Notice that this class of SDPs does not have a trace bound. However, since the objective function is precisely the trace, any bound on the optimal value can be used as the trace bound. In particular, the squared norm of the vector xx is a valid trace bound. Even though xx is unknown, bounds on its norm are known (see for example [63]).

Computational experiments are performed on the synthetic data set from [64] that is based on the coded diffraction pattern model from [11]. Given nn, the hidden solution vector xn{x}\in{\mathbb{C}}^{n} is generated from the complex standard normal distribution. The are m=12nm=12n measurements that are indexed by pairs (j,l)[12]×[n](j,l)\in[12]\times[n]. Consider vectors yjny_{j}\in{\mathbb{C}}^{n} for j[12]j\in[12], where the entries of yjy_{j} are products of of two independent random variables: the first is the uniform distribution on {1,i,1,i}\{1,i,-1,-i\}, and the second chooses from {2/2,3}\{\sqrt{2}/2,\sqrt{3}\} with probabilities 4/54/5 and 1/51/5. The linear measurements correspond to modulating the vector x{x} with each of the yjy_{j}’s and then taking a discrete Fourier transform:

aj,k,x:=DFT(yjx)l for j[12],l[n]\displaystyle\langle a_{j,k},x\rangle:=\mathrm{DFT}(y_{j}\circ x)_{l}\text{ for }j\in[12],\ l\in[n]

where \circ denotes the Hadamard product, and DFT()l\mathrm{DFT}(\cdot)_{l} denotes the ll-th entry of the discrete Fourier transform. The vector bb is obtained by applying the measurements to x{x}. The trace bound is set as τ=3n\tau=3n, similarly as in [64].

Tables 6 and 7 present the results of the computational experiments performed on the phase retrieval SDP. As mentioned in the above paragraph, all instances considered are taken from a synthetic dataset that can be found in [64]. Table 6 compares HALLaR against T-CGAL, 1010-Sketchy, and 100-Sketchy on medium sized phase retrieval instances, i.e., the dimension nn is either 10000 or 31623. The ranks of the outputted solutions of HALLaR and T-CGAL are now also reported. For entries corresponding to HALLaR and T-CGAL, the number reported after the last forward slash indicates the rank of that corresponding method’s outputted solution. A time limit of 14400 seconds (4 hours) is given. Table 7 solely presents the performance of HALLaR on larger sized phase retrieval instances, i.e., with dimension nn greater than or equal to 100,000. The rank of the outputted solution of HALLaR is again reported.

Problem Instance Runtime (seconds)
Problem Size (n;m)(n;m) HALLaR T-CGAL 10-Sketchy 100-Sketchy
10,000; 120,000 69.11/2 */.18e-01/561 */.13e-01 */.25e-01
10,000; 120,000 66.14/2 */.54e-01/521 11112.00 */.80e-01
10,000; 120,000 64.42/2 */.12e00/224 */.31e-01 */.13e00
10,000; 120,000 99.98/2 */.28e-01/201 */.13e00 */.26e-01
31,623; 379,476 620.82/3 */.29e00/1432 */.77e-01 */.23e00
31,623; 379,476 982.34/2 */.23e00/729 */.63e-01 */.93e00
31,623; 379,476 870.25/2 */.66e00/794 */.65e-02 */.78e-01
31,623; 379,476 712.09/2 */.10e+01/1280 */.10e+01 */.82e00
Table 6: Runtimes (in seconds) for the Phase Retrieval problem. A relative tolerance of ϵ=105\epsilon=10^{-5} is set and a time limit of 14400 seconds (4 hours) is given. An entry marked with /N*/N means that the corresponding method finds an approximate solution with relative accuracy strictly larger than the desired accuracy in which case NN expresses the maximum of the three relative accuracies in (86). For entries corresponding to HALLaR and T-CGAL, the number reported after the last forward slash indicates that the rank of that corresponding method’s outputted solution.
Problem Instance Runtime (seconds)
Problem Size (n;m)(n;m) HALLaR
100,000; 1,200,000 1042.92/4
100,000; 1,200,000 1147.46/3
100,000; 1,200,000 929.67/5
100,000; 1,200,000 939.23/5
316,228; 3,794,736 8426.94/5
316,228; 3,794,736 2684.83/1
316,228; 3,794,736 7117.31/6
316,228; 3,794,736 7489.42/7
3,162,278; 37,947,336 40569.10/1
Table 7: Runtimes (in seconds) for the Phase Retrieval problem. The number after the forward slash indicates the rank of HALLaR’s outputted solution. A relative tolerance of ϵ=105\epsilon=10^{-5} is set.

Table 6 only compares HALLaR against T-CGAL and Sketchy-CGAL since these are the only methods that take advantage of the fact that the linear maps 𝒜\mathcal{A} and 𝒜\mathcal{A}^{*} in the phase retrieval SDP can be evaluated efficiently using the fast Fourier transform (FFT). As seen from Table 6, HALLaR is the best performing method and the only method that can solve each instance to a relative accuracy of 10510^{-5} within the time limit of 4 hours. T-CGAL and Sketchy-CGAL were unable to solve most instances to the desired accuracy, often finding solutions with accuracies on the range of 10010^{0} to 10210^{-2} in 44 hours. Sketchy-CGAL was also over 150 times slower than HALLaR on the single instance that it was able to find a 10510^{-5} accurate solution.

Since T-CGAL and Sketchy-CGAL did not perform well on the medium sized phase retrieval instances considered in Table 6, computational results for large sized phase retrieval instances are only presented for HALLaR in Table 7. The results presented in Table 7 show that HALLaR solves a phase retrieval SDP instance with dimension pair (n,m)(105,106)(n,m)\approx(10^{5},10^{6}) in approximately 1515 minutes and also one with dimension pair (n,m)(106,107)(n,m)\approx(10^{6},10^{7}) in just 1111 hours.

4.5   Matrix completion

Consider the problem of retrieving a low rank matrix Mn1×n2M\in{\mathbb{R}}^{n_{1}\times n_{2}}, where n1n2n_{1}\leq n_{2}, by observing a subset of its entries: MijM_{ij}, ijΩij\in\Omega. A standard approach to tackle this problem is by considering the nuclear norm relaxation:

minY{Y:Yij=Mij,ijΩ,Yn1×n2}\min_{Y}\quad\left\{\|Y\|_{*}\quad:\quad Y_{ij}=M_{ij},\ \forall\,ij\in\Omega,\quad Y\in{\mathbb{R}}^{n_{1}\times n_{2}}\right\}

The above problem can be rephrased as the following SDP:

minX{12tr(X):X=(W1YYTW2)0,Yi,j=Mi,jijΩ,X𝕊n1+n2()}\displaystyle\min_{X}\quad\Bigl{\{}\,\frac{1}{2}\operatorname{tr}(X)\quad:\quad X=\begin{pmatrix}W_{1}&Y\\ Y^{T}&W_{2}\end{pmatrix}\succeq 0,\quad Y_{i,j}=M_{i,j}\;\forall\,ij\in\Omega,\quad X\in\mathbb{S}^{n_{1}+n_{2}}({\mathbb{R}})\Bigr{\}} (88)
Problem Instance Runtime (seconds)
Problem Size (n;m)(n;m) rr HALLaR 10-Sketchy
10,000; 828,931 3 321.81 */.81e00/.89e-02
10,000; 828,931 3 332.54 */.80e00/.82e-02
10,000; 2,302,586 5 1117.60 */.92e00/.28e00
10,000; 2,302,586 5 1067.15 */.11e+01/ .41e00
31,623; 2,948,996 3 1681.03 */.81e00/.69e-02
31,623; 2,948,996 3 1362.22 */.81e00/.82e-02
31,623; 8,191,654 5 4740.48 */.90e00/.43e-01
31,623; 8,191,654 5 5238.57 */.90e00/.84e-01
Table 8: Runtimes (in seconds) for the Matrix Completion problem. A relative tolerance of ϵ=105\epsilon=10^{-5} is set and a time limit of 14400 seconds (4 hours) is given. An entry marked with /N1/N2*/N_{1}/N_{2} means that the implicit solution corresponding to 10-Sketchy had relative accuracy strictly larger than the desired accuracy in which case N1N_{1} (resp. N2N_{2}) expresses the maximum of the three relative accuracies in (86) of its computed (resp. implicit) solution.
Problem Instance Runtime (seconds)
Problem Size (n;m)(n;m) rr HALLaR
75,000; 3,367,574 2 3279.85
75,000; 7,577,040 3 5083.68
100,000; 4,605,171 2 2872.44
100,000; 10,361,633 3 6048.63
150,000; 7,151,035 2 10967.74
150,000; 16,089,828 3 14908.08
200,000; 9,764,859 2 13454.12
200,000; 21,970,931 3 28021.56
Table 9: Runtimes (in seconds) for the Matrix Completion problem. A relative tolerance of ϵ=105\epsilon=10^{-5} is set.

The nuclear norm relaxation was introduced in [22]. It was shown in [10] it provably completes the matrix when m=|Ω|m=|\Omega| is sufficiently large and the indices of the observations are independent and uniform.

Similar to the SDP formulation of phase retrieval in subsection 4.4, the SDP formulation of matrix completion does not include a trace bound, but the objective function is a multiple of the trace. Hence, any bound on the optimal value leads to a trace bound. In particular, a valid trace bound is 2Y02\|Y_{0}\|_{*}, where Y0n1×n2Y_{0}\in{\mathbb{R}}^{n_{1}\times n_{2}} is the trivial completion, which agrees with MijM_{ij} in the observed entries and has zeros everywhere else. However, computing the nuclear norm of Y0Y_{0} is expensive, as it requires an SVD decomposition. In the experiments the inexpensive, though weaker, bound τ=2n1Y0F\tau=2\sqrt{n_{1}}\|Y_{0}\|_{F} is used instead.

The matrix completion instances are generated randomly, using the following procedure. Given rn1n2r\leq n_{1}\leq n_{2}, the hidden solution matrix MM is the product UVTUV^{T}, where the matrices Un1×rU\in{\mathbb{R}}^{n_{1}\times r} and Vn2×rV\in{\mathbb{R}}^{n_{2}\times r} have independent standard Gaussian random variables as entries. Afterwards, mm independent and uniformly random observations from MM are taken. The number of observations is m=γr(n1+n2r)m=\lceil\gamma\ r(n_{1}+n_{2}-r)\rceil where γ=rlog(n1+n2)\gamma=r\log(n_{1}+n_{2}) is the oversampling ratio.

Tables 8 and 9 present the results of the computational experiments performed on the matrix completion SDP. All instances are generated randomly using the procedure described in the previous paragraph. Table 8 compares HALLaR against 1010-Sketchy on medium sized matrix completion instances, i.e., the dimension n=n1+n2n=n_{1}+n_{2} is either 10000 or 31623. A time limit of 14400 seconds (4 hours) is given. On instances where 1010-Sketchy did not terminate within the time limit, the relative accuracy of both of its computed and implicit solutions are now reported. An entry marked with /N1/N2*/N_{1}/N_{2} means that, within 4 hours, the implicit solution corresponding to 10-Sketchy had relative accuracy strictly larger than the desired accuracy in which case N1N_{1} (resp. N2N_{2}) expresses the maximum of the three relative accuracies in (86) of its computed (resp. implicit) solution. Table 9 solely presents the performance of HALLaR on larger sized matrix completion instances, i.e., with dimension nn greater than or equal to 75000.

Table 8 only compares HALLaR against 10-Sketchy due to 10-Sketchy’s low memory requirements and its superior/comparable performance to 100-Sketchy and T-CGAL on previous problem classes. As seen from Table 8, HALLaR is the best performing method and the only method that can solve each instance to a relative accuracy of 10510^{-5} within the time limit of 4 hours. 10-Sketchy is unable to solve a single instance to the desired accuracy, often finding solutions with accuracies on the range of 10010^{0} to 10210^{-2} in 4 hours.

Since 10-Sketchy did not perform well on the medium sized matrix completion instances considered in Table 8, computational results for large sized matrix completion instances are only presented for HALLaR in Table 9. The results presented in Table 9 show that HALLaR solves a matrix completion instance with dimension pair (n,m)(105,106)(n,m)\approx(10^{5},10^{6}) in approximately 4848 minutes and also one with dimension pair (n,m)(105,107)(n,m)\approx(10^{5},10^{7}) in just 1.71.7 hours.

Appendix A Technical Results

The following section states some useful facts about the spectraplex Δn\Delta^{n} defined in (1). The first result characterizes the optimal solution a given linear form over the set Δn\Delta^{n}. The second result characterizes its ϵ\epsilon-normal cone.

A.1   Characterization of Optimal Solution of Linear Form over Spectraplex

Consider the problem

minU{GU:UΔn},\min_{U}\{G\bullet U:U\in\Delta^{n}\}, (89)

where Δn\Delta^{n} is as in (1). The optimality condition for (89) implies that ZZ is an optimal solution of (89) if and only if there exists θ\theta\in\mathbb{R} such that the pair (Z,θ)(Z,\theta) satisfies

G+θI0,(G+θI)Z=0,θ0,θ(tr(Z)1)=0.G+\theta I\succeq 0,\quad(G+\theta I)\bullet Z=0,\quad\theta\geq 0,\quad\theta(\operatorname{tr}(Z)-1)=0. (90)

The following proposition explicitly characterizes solutions of the above problem using the special structure of Δn\Delta^{n}.

Proposition A.1.

Let (λmin,vmin)(\lambda_{\min},v_{\min}) be a minimum eigenpair of GG and define

θF=max{λmin(G),0},ZF={vminvminT if θF>0,0 otherwise.\theta^{F}=\max\{-{\lambda}_{\min}(G),0\},\quad Z^{F}=\begin{cases}v_{\min}v_{\min}^{T}&\text{ if $\theta^{F}>0$,}\\ 0&\text{ otherwise}.\end{cases} (91)

Then the pair (Z,θ)=(ZF,θF)(Z,\theta)=(Z^{F},\theta^{F}) satisfies (90). As a consequence, ZFZ^{F} is an optimal solution of (89).

Proof.

Consider the pair (ZF,θF)(Z^{F},\theta^{F}) as defined in (91). It is immediate from the definition of θF\theta^{F} that θF0\theta^{F}\geq 0. Consider now two cases. For the first case, suppose that θF=0\theta^{F}=0 and hence ZF=0Z^{F}=0. It then follows immediately that the pair (ZF,θF)(Z^{F},\theta^{F}) satisfies the optimality conditions in (90).

For the second case, suppose that θF>0\theta^{F}>0. Thus θF=λmin(G)\theta^{F}=-{\lambda}_{\min}(G) and ZF=vminvminTZ^{F}=v_{\min}v_{\min}^{T}. Clearly, then G+θFI0G+\theta^{F}I\succeq 0 and tr(ZF)=1\operatorname{tr}(Z^{F})=1 and hence the pair (ZF,θF)(Z^{F},\theta^{F}) satisfies the first and fourth relations in (90). The fact that (θF,vmin)(-\theta^{F},v_{\min}) is an eigenpair implies that GZF=θFG\bullet Z^{F}=-\theta^{F} and hence the pair (ZF,θF)(Z^{F},\theta^{F}) also satisfies the second relation in (90).

A.2   Characterization of ϵ\epsilon-Normal Cone of Spectraplex

Proposition A.2.

Let ZΔnZ\in\Delta^{n} and θ=θF\theta=\theta^{F} where θF\theta^{F} is as in (91). Then:

  • a)

    there hold

    θ0,G+θI0;\theta\geq 0,\quad G+\theta I\succeq 0; (92)
  • b)

    for any ϵ>0\epsilon>0, the inclusion holds

    0G+NΔnϵ(Z)0\in G+N^{\epsilon}_{\Delta^{n}}(Z) (93)

    if and only if

    GZ+θϵ.G\bullet Z+\theta\leq\epsilon. (94)
Proof.

(a) It follows immediately from the definition of θF\theta^{F} in (91) and the fact that θ=θF\theta=\theta^{F} that the two relations in (92) hold.

(b) Proposition A.1 implies that the pair (ZF,θF)(Z^{F},\theta^{F}) satisfies the relation GZF=θFG\bullet Z^{F}=-\theta^{F} where (ZF,θF)(Z^{F},\theta^{F}) is as in (91). It then follows from this relation, the fact that θ=θF\theta=\theta^{F}, and the inclusions GNΔnϵ(Z)-G\in N^{\epsilon}_{\Delta^{n}}(Z) and ZFΔnZ^{F}\in\Delta^{n} that

ϵG(ZFZ)=GZ+θ.\displaystyle\epsilon\geq-G\bullet(Z^{F}-Z)=G\bullet Z+\theta.

For the other direction, suppose the pair (Z,θ)(Z,\theta) satisfies (94) and let UΔnU\in\Delta^{n}. It then follows that

G(UZ)\displaystyle-G\bullet(U-Z) =GZ(G+θI)U+θtr(U)\displaystyle=G\bullet Z-(G+\theta I)\bullet U+\theta\operatorname{tr}(U)
GZ+θ(G+θI)U\displaystyle\leq G\bullet Z+\theta-(G+\theta I)\bullet U
ϵ0=ϵ.\displaystyle\leq\epsilon-0=\epsilon.

Hence, GNΔnϵ(Z)-G\in N^{\epsilon}_{\Delta^{n}}(Z), proving the result. ∎

Appendix B ADAP-FISTA Method

Let 𝔼\mathbb{E} denote a finite-dimensional inner product real vector space with inner product and induced norm denoted by ,\langle\cdot,\cdot\rangle and \|\cdot\|, respectively. Also, let ψn:𝔼(,]\psi_{n}:\mathbb{E}\to(-\infty,\infty] be a proper closed convex function whose domain domψn:=𝒩𝔼\mathrm{dom}\,\psi_{n}:=\mathcal{N}\subseteq\mathbb{E}, has finite diameter DψnD_{\psi_{n}}.

ADAP-FISTA considers the following problem

min{ψ(u):=ψs(u)+ψn(u):u𝔼}\min\{\psi(u):=\psi_{s}(u)+\psi_{n}(u):u\in\mathbb{E}\} (95)

where ψs\psi_{s} is assumed to satisfy the following assumption:

  • (B1)

    ψs\psi_{s} is a real-valued function that is differentiable on 𝔼\mathbb{E} and there exists L¯0\bar{L}\geq 0 such that

    ψs(u)ψs(u)L¯uuu,uB~,\|\nabla\psi_{s}(u^{\prime})-\nabla\psi_{s}(u)\|\leq\bar{L}\|u^{\prime}-u\|\quad\forall u,u^{\prime}\in\tilde{B}, (96)

    where

    B~:=𝒩+B¯Dψn\tilde{B}:=\mathcal{N}+\bar{B}_{D_{\psi_{n}}} (97)

    and B¯l:={u:ul}\bar{B}_{l}:=\{u:\|u\|\leq l\} is the closed unit ball centered at 0 with radius ll.

We now describe the type of approximate solution that ADAP-FISTA aims to find.

Problem: Given ψ\psi satisfying the above assumptions, a point x0𝒩x_{0}\in\mathcal{N}, a parameter σ(0,)\sigma\in(0,\infty), the problem is to find a pair (y,r)𝒩×𝔼(y,r)\in\mathcal{N}\times\mathbb{E} such that

rσyx0,rψs(y)+ψn(y).\displaystyle\|r\|\leq\sigma\|y-x_{0}\|,\quad r\in\nabla\psi_{s}(y)+\partial\psi_{n}(y). (98)

We are now ready to present the ADAP-FISTA algorithm below.

 

ADAP-FISTA Method

 

Universal Parameters: σ>0\sigma>0 and χ(0,1)\chi\in(0,1).

Input: initial point x0𝒩x_{0}\in\mathcal{N}, scalars μ>0\mu>0, L0>μL_{0}>\mu, and function pair (ψs,ψn)(\psi_{s},\psi_{n}).

  • 0.

    set y0=x0y_{0}=x_{0}, A0=0A_{0}=0, τ0=1\tau_{0}=1, and i=0i=0;

  • 1.

    Set Li+1=LiL_{i+1}=L_{i};

  • 2.

    Compute

    ai=τi+τi2+4τiAi(Li+1μ)2(Li+1μ),x~i=Aiyi+aixiAi+ai,a_{i}=\frac{\tau_{i}+\sqrt{\tau_{i}^{2}+4\tau_{i}A_{i}(L_{i+1}-\mu)}}{2(L_{i+1}-\mu)},\quad\tilde{x}_{i}=\frac{A_{i}y_{i}+a_{i}x_{i}}{A_{i}+a_{i}}, (99)
    yi+1:=argminu𝒩{qi(u;x~i,Li+1):=ψs(u;x~i)+ψn(u)+Li+12ux~i2},y_{i+1}:=\underset{u\in\mathcal{N}}{\operatorname*{arg\,min}}\left\{q_{i}(u;\tilde{x}_{i},L_{i+1}):=\ell_{\psi_{s}}(u;\tilde{x}_{i})+\psi_{n}(u)+\frac{L_{i+1}}{2}\|u-\tilde{x}_{i}\|^{2}\right\}, (100)

    If the inequality

    ψs(yi+1;x~i)+(1χ)Li+14yi+1x~i2ψs(yi+1)\ell_{\psi_{s}}(y_{i+1};\tilde{x}_{i})+\frac{(1-\chi)L_{i+1}}{4}\|y_{i+1}-\tilde{x}_{i}\|^{2}\geq\psi_{s}(y_{i+1}) (101)

    holds go to step 3; else set Li+12Li+1L_{i+1}\leftarrow 2L_{i+1} and repeat step 2;

  • 3.

    Compute

    Ai+1\displaystyle A_{i+1} =Ai+ai,τi+1=τi+aiμ,\displaystyle=A_{i}+a_{i},\quad\tau_{i+1}=\tau_{i}+a_{i}\mu, (102)
    si+1\displaystyle s_{i+1} =(Li+1μ)(x~iyi+1),\displaystyle=(L_{i+1}-\mu)(\tilde{x}_{i}-y_{i+1}), (103)
    xi+1\displaystyle\quad x_{i+1} =1τi+1[μaiyi+1+τixiaisi+1];\displaystyle=\frac{1}{\tau_{i+1}}\left[\mu a_{i}y_{i+1}+\tau_{i}x_{i}-a_{i}s_{i+1}\right]; (104)
  • 4.

    If the inequality

    yi+1x02χAi+1Li+1yi+1x~i2,\displaystyle\|y_{i+1}-x_{0}\|^{2}\geq\chi A_{i+1}L_{i+1}\|y_{i+1}-\tilde{x}_{i}\|^{2}, (105)

    holds, then go to step 5; otherwise, stop with failure;

  • 5.

    Compute

    vi+1=ψs(yi+1)ψs(x~i)+Li+1(x~iyi+1).v_{i+1}=\nabla\psi_{s}(y_{i+1})-\nabla\psi_{s}(\tilde{x}_{i})+L_{i+1}(\tilde{x}_{i}-y_{i+1}). (106)

    If the inequality

    vi+1σyi+1x0\|v_{i+1}\|\leq\sigma\|y_{i+1}-x_{0}\| (107)

    holds then stop with success and output (y,v,L):=(yi+1,vi+1,Li+1)(y,v,L):=(y_{i+1},v_{i+1},L_{i+1}); otherwise, ii+1i\leftarrow i+1 and go to step 1.

 

The ADAP-FISTA method was first developed in [56]. The method assumes that the gradient of ψs\psi_{s} is Lipschitz continuous on all of 𝔼\mathbb{E} since it requires Lipchitz continuity of the gradient at the sequences of points {x~i}\{\tilde{x}_{i}\} and {yi}\{y_{i}\}. This assumption can be relaxed to as in (B3) by showing that the sequence {x~i}\{\tilde{x}_{i}\} lies in the set B~\tilde{B} defined in (97). The following lemma establishes this result inductively by using several key lemmas which can be found in [56].

Lemma B.1.

Let m1m\geq 1 and suppose ADAP-FISTA generates sequence {x~i}i=0mB~\{\tilde{x}_{i}\}^{m}_{i=0}\subseteq\tilde{B}. Then, the following statements hold:

  • (a)

    L0Li+1max{L0,4L¯/(1χ)}L_{0}\leq L_{i+1}\leq\max\{L_{0},4\bar{L}/(1-\chi)\} for any i{0,m}i\in\{0,\ldots m\};

  • (b)

    for any x𝒩x\in\mathcal{N}, the relation

    Ai[ψ(ym+1)ψ(x)]+τm+12xxm+1212xx02χ2i=0mAi+1Li+1yi+1x~i2\displaystyle A_{i}[\psi(y_{m+1})-\psi(x)]+\frac{\tau_{m+1}}{2}\|x-x_{m+1}\|^{2}\leq\frac{1}{2}\|x-x_{0}\|^{2}-\frac{\chi}{2}\sum_{i=0}^{m}A_{i+1}L_{i+1}\|y_{i+1}-\tilde{x}_{i}\|^{2} (108)

    holds;

  • (c)

    x~m+1B~\tilde{x}_{m+1}\in\tilde{B}.

As a consequence, the entire sequence {x~i}B~\{\tilde{x}_{i}\}\subseteq\tilde{B}.

Proof.

(a) Let i{0,m}.i\in\{0,\ldots m\}. Clearly yi+1B~y_{i+1}\in\tilde{B} since the definition of B~\tilde{B} in (97) implies that B~𝒩\tilde{B}\supseteq\mathcal{N}. Now, using the facts that x~iB~\tilde{x}_{i}\in\tilde{B} and yi+1B~y_{i+1}\in\tilde{B}, assumption (B1) implies that ψs\nabla\psi_{s} is Lipschitz continuous at these points. The proof of (a) is then identical to the one of Lemma A.3(b) in [56].

(b) Clearly, since ADAP-FISTA generates sequence {x~i}i=0m\{\tilde{x}_{i}\}_{i=0}^{m}, its loop in step 2 always terminates during its first mm iterations. Hence, ADAP-FISTA also generates sequences {yi}i=0m+1\{y_{i}\}_{i=0}^{m+1} and {xi}i=0m+1\{x_{i}\}_{i=0}^{m+1}. The proof of relation (108) then follows from this observation, (a), and by using similar arguments to the ones developed in Lemmas A.6-A.10 of [56].

(c) It follows from the fact that τm+11\tau_{m+1}\geq 1, relation (108) with x=ym+1x=y_{m+1}, and the definition of DψnD_{\psi_{n}} that ym+1xm+1ym+1x0Dψn\|y_{m+1}-x_{m+1}\|\leq\|y_{m+1}-x_{0}\|\leq D_{\psi_{n}}. This relation, the fact that ym+1𝒩y_{m+1}\in\mathcal{N}, and the definition of B~\tilde{B} in (97) then imply that

xm+1𝒩+B¯Dψn=B~.\displaystyle x_{m+1}\subseteq\mathcal{N}+\bar{B}_{D_{\psi_{n}}}=\tilde{B}.

The result then immediately follows from the fact that x~m+1\tilde{x}_{m+1} is a convex combination of xm+1x_{m+1} and ym+1y_{m+1} and that B~\tilde{B} is a convex set.

The last statement in Proposition B.1 follows immediately from (c) and a simple induction argument. ∎

We now present the main convergence results of ADAP-FISTA, whose proofs can be found in [56]. Proposition B.2 below gives an iteration complexity bound regardless if ADAP-FISTA terminates with success or failure and shows that if ADAP-FISTA successfully stops, then it obtains a stationary solution of (95) with respect to a relative error criterion. It also shows that ADAP-FISTA always stops successfully whenever ψs\psi_{s} is μ\mu-strongly convex.

Proposition B.2.

The following statements about ADAP-FISTA hold:

  • (a)

    if L0=𝒪(L¯)L_{0}={\cal O}(\bar{L}), it always stops (with either success or failure) in at most

    𝒪1(L¯μlog1+(L¯))\mathcal{O}_{1}\left(\sqrt{\frac{\bar{L}}{\mu}}\log^{+}_{1}(\bar{L})\right)

    iterations/resolvent evaluations;

  • (b)

    if it stops successfully, it terminates with a triple (y,v,L)𝒩×𝔼×++(y,v,L)\in\mathcal{N}\times\mathbb{E}\times{\mathbb{R}}_{++} satisfying

    vψs(y)+ψn(y),vσyx0,Lmax{L0,4L¯/(1χ)};\displaystyle v\in\nabla\psi_{s}(y)+\partial\psi_{n}(y),\quad\|v\|\leq\sigma\|y-x_{0}\|,\quad L\leq\max\{L_{0},4\bar{L}/(1-\chi)\}; (109)
  • (c)

    if ψs\psi_{s} is μ\mu-convex on 𝒩\mathcal{N}, then ADAP-FISTA always terminates with success and its output (y,v,L)(y,v,L), in addition to satisfying (109) also satisfies the inclusion v(ψs+ψn)(y)v\in\partial(\psi_{s}+\psi_{n})(y).

Appendix C Proof of Proposition 2.4

This section provides the proof of Proposition 2.4 stated in Subsection 2.2.

Let B¯r:={U:UFr}\bar{B}_{r}:=\{U:\|U\|_{F}\leq r\}. The following lemma establishes that the function g~()\tilde{g}(\cdot) in (15) has Lipschitz continuous gradient over B¯3\bar{B}_{3}.

Lemma C.1.

The function g~(U)\tilde{g}(U) defined in (15) is Lg~L_{\tilde{g}}-smooth on B¯3\bar{B}_{3} where Lg~L_{\tilde{g}} is as in (25).

Proof.

Let U1,U2B¯3U_{1},U_{2}\in\bar{B}_{3} be given. Adding and subtracting U1U2TU_{1}U^{T}_{2} and using the triangle inequality, we have

U1U1TU2U2TFU1U1U2F+U2U1U2F6U1U2F.\|U_{1}U_{1}^{T}-U_{2}U_{2}^{T}\|_{F}\leq\|U_{1}\|\,\|U_{1}-U_{2}\|_{F}+\|U_{2}\|\,\|U_{1}-U_{2}\|_{F}\leq 6\|U_{1}-U_{2}\|_{F}. (110)

This relation, the chain rule, the triangle inequality, the facts that U1,U2B¯3U_{1},U_{2}\in\bar{B}_{3} and gg is LgL_{g}-smooth on B¯3\bar{B}_{3}, and the definition of G¯\bar{G} in (25), then imply that

g~(U1)g~(U2)F=2g(U1U1T)U12g(U2U2T)U2F\displaystyle\|\nabla\tilde{g}(U_{1})-\nabla\tilde{g}(U_{2})\|_{F}=\|2\nabla g(U_{1}U^{T}_{1})U_{1}-2\nabla g(U_{2}U^{T}_{2})U_{2}\|_{F}
2g(U1U1T)U12g(U1U1T)U2F+2g(U1U1T)U22g(U2U2T)U2F\displaystyle\leq\|2\nabla g(U_{1}U^{T}_{1})U_{1}-2\nabla g(U_{1}U^{T}_{1})U_{2}\|_{F}+\|2\nabla g(U_{1}U^{T}_{1})U_{2}-2\nabla g(U_{2}U^{T}_{2})U_{2}\|_{F}
2g(U1U1T)FU1U2F+2U2g(U1U1T)g(U2U2T)F\displaystyle\leq 2\|\nabla g(U_{1}U^{T}_{1})\|_{F}\,\|U_{1}-U_{2}\|_{F}+2\|U_{2}\|\,\|\nabla g(U_{1}U^{T}_{1})-\nabla g(U_{2}U^{T}_{2})\|_{F}
(25)2G¯U1U2F+6g(U1U1T)g(U2U2T)F\displaystyle\overset{\eqref{phi* alpha0 quantities}}{\leq}2\bar{G}\|U_{1}-U_{2}\|_{F}+6\|\nabla g(U_{1}U^{T}_{1})-\nabla g(U_{2}U^{T}_{2})\|_{F}
(9)2G¯U1U2F+6LgU1U1TU2U2TF\displaystyle\overset{\eqref{upper curvature g}}{\leq}2\bar{G}\|U_{1}-U_{2}\|_{F}+6L_{g}\|U_{1}U^{T}_{1}-U_{2}U^{T}_{2}\|_{F}
(110)2G¯U1U2F+36LgU1U2F=(2G¯+36Lg)U1U2F.\displaystyle\overset{\eqref{MVT}}{\leq}2\bar{G}\|U_{1}-U_{2}\|_{F}+36L_{g}\|U_{1}-U_{2}\|_{F}=(2\bar{G}+36L_{g})\|U_{1}-U_{2}\|_{F}.

The conclusion of the lemma now follows from the above inequality and the definition of Lg~L_{\tilde{g}} in (25). ∎

The following lemma establishes key properties about each ADAP-FISTA call made in step 1 of ADAP-AIPP. It is a translation of the results in Proposition B.2.

Lemma C.2.

Let (ψs,ψn)(\psi_{s},\psi_{n}) be as in (20). The following statements about each ADAP-FISTA call made in the jj-th iteration of ADAP-AIPP hold:

  • (a)

    if M¯j=𝒪(1+λ0Lg~)\underline{M}_{j}={\cal O}(1+\lambda_{0}L_{\tilde{g}}), it always stops (with either success or failure) in at most

    𝒪1(2(1+λ0Lg~)log1+(1+λ0Lg~))\mathcal{O}_{1}\left(\sqrt{2\left(1+\lambda_{0}L_{\tilde{g}}\right)}\log^{+}_{1}(1+\lambda_{0}L_{\tilde{g}})\right)

    iterations/resolvent evaluations where λ0\lambda_{0} is the initial prox stepsize and Lg~L_{\tilde{g}} is as in (25);

  • (b)

    if ADAP-FISTA stops successfully, it terminates with a triple (W,V,L)B¯1×𝕊n×++(W,V,L)\in\bar{B}_{1}\times\mathbb{S}^{n}\times{\mathbb{R}}_{++} satisfying

    Vλ[g~(W)+δB¯1(W)]+(WWj1)V\in\lambda\left[\nabla\tilde{g}(W)+\partial\delta_{\bar{B}_{1}}(W)\right]+(W-W_{j-1}) (111)
    VFσWWj1F,Lmax{M¯j,ω(1+λ0Lg~)}\|V\|_{F}\leq\sigma\|W-W_{j-1}\|_{F},\quad L\leq\max\{\underline{M}_{j},\omega(1+\lambda_{0}L_{\tilde{g}})\} (112)

    where ω=4/(1χ)\omega=4/(1-\chi);

  • (c)

    if ψs\psi_{s} is 1/21/2-convex on B¯1\bar{B}_{1}, then ADAP-FISTA always terminates with success and its output (W,V,L)(W,V,L) always satisfies relation (21).

Proof.

(a) The result follows directly from Proposition B.2 in Appendix B and hence the proof relies on verifying its assumptions. First it is easy to see that domψn+B¯2=B¯3\mathrm{dom}\,\psi_{n}+\bar{B}_{2}=\bar{B}_{3}, where ψn\psi_{n} is as in (20). It also follows immediately from the fact that λλ0\lambda\leq\lambda_{0} and from Lemma C.1 that ψs\psi_{s} is (1+λ0Lg~)(1+\lambda_{0}L_{\tilde{g}})-smooth on B¯3\bar{B}_{3} in view of its definition in (20). These two observations imply that L¯=1+λ0Lg~\bar{L}=1+\lambda_{0}L_{\tilde{g}} satisfies (96). Hence, it follows from this conclusion, the fact that each ADAP-FISTA call is made with (μ,L0)=(1/2,M¯j)(\mu,L_{0})=(1/2,\underline{M}_{j}), and from Proposition B.2(a) that statement (a) holds.

(b) In view of the definition of ψs\psi_{s} in (20), it is easy that see that ψs(W)=λg~(W)+(WWj1)\nabla\psi_{s}(W)=\lambda\tilde{g}(W)+(W-W_{j-1}). Statement (b) then immediately follows from this observation, the fact that each ADAP-FISTA call is made with inputs x0=Wj1x_{0}=W_{j-1}, L0=M¯jL_{0}=\underline{M}_{j}, and (ψs,ψn)(\psi_{s},\psi_{n}) as in (20), and from Proposition B.2(b) with L¯=1+λ0Lg~\bar{L}=1+\lambda_{0}L_{\tilde{g}}.

(c) It follows immediately from Proposition B.2(c) and the fact that each ADAP-FISTA call is made with inputs (ψs,ψn)(\psi_{s},\psi_{n}) as in (20) and μ=1/2\mu=1/2 that the first conclusion of statement (c) holds and that output (W,V,L)(W,V,L) satisfies inclusion

V(λ(g~+δB¯1)()+12Wj1F2)(W).V\in\partial\left(\lambda\left(\tilde{g}+\delta_{\bar{B}_{1}}\right)(\cdot)+\frac{1}{2}\|\cdot-W_{j-1}\|^{2}_{F}\right)\left(W\right). (113)

Inclusion (113) and the definition of subdifferential in (5) then immediately imply that output (W,V,L)(W,V,L) satisfies relation (21). ∎

The lemma below shows that, in every iteration of ADAP-AIPP, the loop within steps 1 and 2 always stops and shows key properties of its output. Its proof (included here for completeness) closely follows the one of Proposition 3.1 of [56].

Lemma C.3.

The following statements about ADAP-AIPP hold for every j1j\geq 1:

  • (a)

    the function ψs\psi_{s} in (20) has (1+λ0Lg~)(1+\lambda_{0}L_{\tilde{g}})-Lipschitz continuous gradient on B¯3\bar{B}_{3};

  • (b)

    the loop within steps 1 and 2 of its jj-th iteration always ends and the output (Wj,Vj,Rj,λj,M¯j)(W_{j},V_{j},R_{j},{\lambda}_{j},\bar{M}_{j}) obtained at the end of step 2 satisfies

    Rjg~(Wj)+δB¯1(Wj);\displaystyle R_{j}\in\nabla\tilde{g}(W_{j})+\partial\delta_{\bar{B}_{1}}(W_{j}); (114)
    (1σσ)VjFλjRjF(1+σ)WjWj1F;\displaystyle\left(\frac{1-\sigma}{\sigma}\right)\|V_{j}\|_{F}\leq\|\lambda_{j}R_{j}\|_{F}\leq(1+\sigma)\|W_{j}-W_{j-1}\|_{F}; (115)
    λjg~(Wj1)[λg~(Wj)+12WjWj1F2]Vj(Wj1Wj);\displaystyle\lambda_{j}\tilde{g}(W_{j-1})-\left[\lambda\tilde{g}(W_{j})+\frac{1}{2}\|W_{j}-W_{j-1}\|_{F}^{2}\right]\geq V_{j}\bullet(W_{j-1}-W_{j}); (116)
    M¯jmax{M¯j,ω(1+λ0Lg~)};\displaystyle\bar{M}_{j}\leq\max\{\underline{M}_{j},\omega\mathcal{(}1+\lambda_{0}L_{\tilde{g}})\}; (117)
    λ0λjλ¯,\displaystyle\lambda_{0}\geq\lambda_{j}\geq\underline{\lambda}, (118)

    where ω=4/(1χ)\omega=4/(1-\chi), λ0\lambda_{0} is the initial prox stepsize, and Lg~L_{\tilde{g}} and λ¯\underline{\lambda} are as in (25) and (26), respectively; moreover, every prox stepsize λ{\lambda} generated within the aforementioned loop is in [λ¯,λ0][\underline{{\lambda}},\lambda_{0}].

Proof.

(a) It follows that ψs\psi_{s} has (1+λLg~)(1+\lambda L_{\tilde{g}})-Lipschitz continuous gradient on B¯3\bar{B}_{3} in view of its definition in (20) and Lemma C.1. The result then follows immediately from the fact that λλ0\lambda\leq\lambda_{0}.

(b) We first claim that if the loop consisting of steps 1 and 2 of the jj-th iteration of ADAP-AIPP stops, then relations (114), (115), (116), and (117) hold. Indeed, assume that the loop consisting of steps 1 and 2 of the jj-th iteration of ADAP-AIPP stops. It then follows from the logic within step 1 and 2 of ADAP-AIPP that the last ADAP-FISTA call within the loop stops successfully and outputs triple (Wj,Vj,M¯j)(W_{j},V_{j},\bar{M}_{j}) satisfying (21), which immediately implies that (116) holds. Since (a) implies that L¯=1+λ0Lg~\bar{L}=1+\lambda_{0}L_{\tilde{g}} satisfies relation (96), it follows Proposition B.2(b) with (ψs,ψn)(\psi_{s},\psi_{n}) as in (20), x0=Wj1x_{0}=W_{j-1}, and L0=M¯jL_{0}=\underline{M}_{j} that the triple (Wj,Vj,M¯j)=(y,v,L)(W_{j},V_{j},\bar{M}_{j})=(y,v,L) satisfies inequality (117) and the following two relations

Vjλj[g~(Wj)+δB¯1(Wj)]+WjWj1\displaystyle V_{j}\in\lambda_{j}[\nabla\tilde{g}(W_{j})+\partial\delta_{\bar{B}_{1}}(W_{j})]+W_{j}-W_{j-1} (119)
VjFσWjWj1F.\displaystyle\|V_{j}\|_{F}\leq\sigma\|W_{j}-W_{j-1}\|_{F}. (120)

Now, using the definition of RjR_{j} in (22), it is easy to see that the inclusion (119) is equivalent to (114) and that the inequality in (120) together with the triangle inequality for norms imply the two inequalities in (115).

We now claim that if step 1 is performed with a prox stepsize λ1/(2Lg~)\lambda\leq 1/(2L_{\tilde{g}}) in the jj-th iteration, then for every l>jl>j, we have that λl1=λ{\lambda}_{l-1}={\lambda} and the ll-th iteration performs step 1 only once. To show the claim, assume that λ1/(2Lg~)\lambda\leq 1/(2L_{\tilde{g}}). Using this assumption and the fact that Lemma C.1 implies that g~\tilde{g} is Lg~L_{\tilde{g}} weakly convex on B¯3\bar{B}_{3}, it is easy to see that the function ψs\psi_{s} in (20) is strongly convex on B¯1B¯3\bar{B}_{1}\subseteq\bar{B}_{3} with modulus 1λLg~1/21-{\lambda}L_{\tilde{g}}\geq 1/2. Since each ADAP-FISTA call is performed in step 1 of ADAP-AIPP with μ=1/2\mu=1/2, it follows immediately from Proposition B.2(c) with (ψs,ψn)(\psi_{s},\psi_{n}) as in (20) that ADAP-FISTA terminates successfully and outputs a pair (W,V)(W,V) satisfying V(ψs+ψn)(W)V\in\partial(\psi_{s}+\psi_{n})(W). This inclusion, the definitions of (ψs,ψn)(\psi_{s},\psi_{n}), and the definition of subdifferential in (5), then imply that (21) holds. Hence, in view of the termination criteria of step 2 of ADAP-AIPP, it follows that λj=λ\lambda_{j}=\lambda. It is then easy to see, by the way λ\lambda is updated in step 2 of ADAP-AIPP, that λ\lambda is not halved in the (j+1)(j+1)-th iteration or any subsequent iteration, hence proving the claim.

It is now straightforward to see that the above two claims, the fact that the initial value of the prox stepsize is equal to λ0\lambda_{0}, and the way λj{\lambda}_{j} is updated in ADAP-AIPP, imply that the lemma holds. ∎

Lemma C.4.

For any j1j\geq 1, the quantity M¯j\underline{M}_{j} satisfies

M¯jω(1+λ0Lg~)\underline{M}_{j}\leq\omega(1+\lambda_{0}L_{\tilde{g}}) (121)

where ω=4/(1χ)\omega=4/(1-\chi), λ0\lambda_{0} is the initial prox stepsize, and Lg~L_{\tilde{g}} is as in (25).

Proof.

The result follows from a simple induction argument. The inequality with j=1j=1 is immediate due to the facts that M¯1=1\underline{M}_{1}=1 and ω>1\omega>1. Now suppose inequality (121) holds for j1j-1. It then follows from relation (117) and the fact that M¯jM¯j1\underline{M}_{j}\leq\bar{M}_{j-1} that

M¯jM¯j1(117)max{M¯j1,ω(1+λ0Lg~)}=ω(1+λ0Lg~),\underline{M}_{j}\leq\bar{M}_{j-1}\overset{\eqref{upper L bound}}{\leq}\max\{\underline{M}_{j-1},\omega(1+\lambda_{0}L_{\tilde{g}})\}=\omega(1+\lambda_{0}L_{\tilde{g}}),

where the equality is due to the assumption that (121) holds for j1j-1. Hence, Lemma C.4 holds. ∎

Remark C.5.

It follows from Lemma C.4 that M¯j=𝒪(1+λ0Lg~)\underline{M}_{j}={\cal O}(1+\lambda_{0}L_{\tilde{g}}) and hence Lemma C.2(a) implies that each ADAP-FISTA call made in step 1 of ADAP-AIPP performs at most

𝒪1(2(1+λ0Lg~)log1+(1+λ0Lg~))\mathcal{O}_{1}\left(\sqrt{2\left(1+\lambda_{0}L_{\tilde{g}}\right)}\log^{+}_{1}(1+\lambda_{0}L_{\tilde{g}})\right)

iterations/resolvent evaluations where λ0\lambda_{0} is the initial prox stepsize and Lg~L_{\tilde{g}} is as in (25).

The following lemma shows that ADAP-AIPP is a descent method.

Lemma C.6.

If jj is an iteration index for ADAP-AIPP, then

λ¯CσRjF2\displaystyle\frac{\underline{\lambda}}{C_{\sigma}}\|R_{j}\|_{F}^{2} g~(Wj1)g~(Wj)\displaystyle\leq\tilde{g}(W_{j-1})-\tilde{g}(W_{j}) (122)

where CσC_{\sigma} and λ¯\underline{\lambda} are as in (26).

Proof.

It follows immediately from the first inequality in (115), relation (116), and the definitions of RjR_{j} and CσC_{\sigma} in (22) and (26), respectively that:

λjg~(Wj1)λjg~(Wj)\displaystyle\lambda_{j}\tilde{g}(W_{j-1})-\lambda_{j}\tilde{g}(W_{j}) (116)12WjWj1F2+Vj(Wj1Wj)\displaystyle\overset{\eqref{subdiff zk}}{\geq}\frac{1}{2}\|W_{j}-W_{j-1}\|_{F}^{2}+V_{j}\bullet(W_{j-1}-W_{j})
=12Wj1Wj+VjF212VjF2\displaystyle=\frac{1}{2}\|W_{j-1}-W_{j}+V_{j}\|_{F}^{2}-\frac{1}{2}\|V_{j}\|_{F}^{2}
=(22)12λjRjF212VjF2\displaystyle\overset{\eqref{Rj def}}{=}\frac{1}{2}\|\lambda_{j}R_{j}\|_{F}^{2}-\frac{1}{2}\|V_{j}\|_{F}^{2}
(115)12λjRjF2σ22(1σ)2λjRjF2\displaystyle\overset{\eqref{w inequalities}}{\geq}\frac{1}{2}\|\lambda_{j}R_{j}\|_{F}^{2}-\frac{\sigma^{2}}{2(1-\sigma)^{2}}\|\lambda_{j}R_{j}\|_{F}^{2}
=12σ2(1σ)2λjRjF2=(26)λjRjF2Cσ.\displaystyle=\frac{1-2\sigma}{2(1-\sigma)^{2}}\|\lambda_{j}R_{j}\|_{F}^{2}\overset{\eqref{phi* alpha0 quantities-2}}{=}\frac{\|\lambda_{j}R_{j}\|_{F}^{2}}{C_{\sigma}}. (123)

Dividing inequality (C) by λj\lambda_{j} and using relation (118) then imply

g~(Wj1)g~(Wj)(C)λjCσRjF2(118)λ¯CσRjF2\displaystyle\tilde{g}(W_{j-1})-\tilde{g}(W_{j})\overset{\eqref{string of inequalities}}{\geq}\frac{\lambda_{j}}{C_{\sigma}}\|R_{j}\|_{F}^{2}\overset{\eqref{lower lambda bound}}{\geq}\frac{\underline{\lambda}}{C_{\sigma}}\|R_{j}\|_{F}^{2}

from which the result of the lemma immediately follows. ∎

We are now ready to give the proof of Proposition 2.4.

Proof of Proposition 2.4.

(a) The first statement of (a) follows immediately from the fact that relation (114) and the termination criterion of ADAP-AIPP in its step 3 imply that the pair (W¯,R¯)(\overline{W},\overline{R}) satisfies (24). Assume now by contradiction that (27) does not hold. This implies that there exists an iteration index ll such that

l>1+Cσλ¯ρ¯2[g~(W¯)g~(Wl)].l>1+\frac{C_{\sigma}}{\underline{\lambda}\bar{\rho}^{2}}\left[\tilde{g}(\underline{W})-\tilde{g}(W_{l})\right]. (124)

As ADAP-AIPP generates ll as an iteration index, it does not terminate at the (l1)(l-1)-th iteration. In view of step 3 of ADAP-AIPP, this implies that RjF>ρ¯\|R_{j}\|_{F}>\bar{\rho} for every j=1,,l1j=1,\ldots,l-1. Using this conclusion and the fact that W0=W¯W_{0}=\underline{W}, and summing inequality (122) from 11 to ll, we conclude that

(l1)ρ¯2<j=1l1RjF2j=1lRjF2(122)Cσλ¯j=1lg~(Wj1)g~(Wj)=Cσλ¯[g~(W¯)g~(Wl)]\displaystyle(l-1)\bar{\rho}^{2}<\sum_{j=1}^{l-1}\|R_{j}\|_{F}^{2}\leq\sum_{j=1}^{l}\|R_{j}\|_{F}^{2}\overset{\eqref{auxineq:declemma3}}{\leq}\frac{C_{\sigma}}{\underline{\lambda}}\sum_{j=1}^{l}\ \tilde{g}(W_{j-1})-\tilde{g}(W_{j})=\frac{C_{\sigma}}{\underline{\lambda}}\left[\tilde{g}(\underline{W})-\tilde{g}(W_{l})\right]

which can be easily seen to contradict (124).

(b) The result follows immediately from (a) and the fact that the number of times λ{\lambda} is divided by 22 in step 2 of ADAP-AIPP is at most log0+(λ0/λ¯)/log2\lceil\log_{0}^{+}(\lambda_{0}/{\underline{{\lambda}}})/\log 2\rceil. ∎

Appendix D Relaxed Frank-Wolfe Method

Let 𝔼\mathbb{E} denote a finite-dimensional inner product real vector space with inner product and induced norm denoted by ,\langle\cdot,\cdot\rangle and \|\cdot\|, respectively. Let Ω𝔼\Omega\subseteq\mathbb{E} be a nonempty compact convex set with diameter DΩD_{\Omega}. Consider the problem

(P)g:=minU{g(U):UΩ}(P)\quad g_{*}:=\min_{U}\{g(U):U\in\Omega\} (125)

where g:𝔼g:\mathbb{E}\rightarrow\mathbb{R} is a convex function that satisfies the following assumption:

  • (A1)

    there exists Lg>0L_{g}>0 such that

    g(U)g(U;U)Lg2UU2U,UΩ.g(U^{\prime})-\ell_{g}(U^{\prime};U)\leq\frac{L_{g}}{2}\|U^{\prime}-U\|^{2}\quad\forall U,U^{\prime}\in\Omega. (126)

The formal description of the Relaxed FW (RFW) method and its main complexity result for finding a near-optimal solution of (125) are presented below. The proof of the main result is given in the next subsection.

 

RFW Method

 

Input: tolerance ϵ¯>0\bar{\epsilon}>0 and initial point Z~0Ω\tilde{Z}_{0}\in\Omega.

Output: a point Z¯\bar{Z}.

  • 0.

    set k=1k=1;

  • 1.

    find a point ZkΩZ_{k}\in\Omega such that

    g(Zk)g(Z~k1);g(Z_{k})\leq g(\tilde{Z}_{k-1}); (127)
  • 2.

    compute

    ZkFargmin{g(U;Zk):UΩ},Dk:=ZkZkF,ϵk:=g(Zk),Dk;Z^{F}_{k}\in\operatorname*{arg\,min}\{\ell_{g}(U;Z_{k}):U\in\Omega\},\quad D_{k}:=Z_{k}-Z^{F}_{k},\quad\epsilon_{k}:=\langle\nabla g(Z_{k}),D_{k}\rangle; (128)
  • 3.

    if ϵkϵ¯\epsilon_{k}\leq\bar{\epsilon}, then stop and output the point Z¯=Zk\bar{Z}=Z_{k}; else compute

    αk=argminα{g(ZkαDk):α[0,1]}}\alpha_{k}=\operatorname*{arg\,min}_{\alpha}\left\{g(Z_{k}-\alpha D_{k}):\alpha\in[0,1]\right\}\} (129)

    and set

    Z~k=ZkαkDk;\tilde{Z}_{k}=Z_{k}-\alpha_{k}D_{k}; (130)
  • 4.

    set kk+1k\leftarrow k+1 and go to step 1.

 

Theorem D.1.

For a given tolerance ϵ¯>0\bar{\epsilon}>0, the RFW method finds a point Z¯Ω\bar{Z}\in\Omega such that

0g(Z¯)+ϵ¯δΩ(Z¯)0\in\nabla g(\bar{Z})+\partial_{\bar{\epsilon}}\delta_{\Omega}(\bar{Z}) (131)

in at most

1+4max{g(Z~0)g,(g(Z~0)g)LgDΩ2,LgDΩ2}ϵ¯\left\lceil 1+\frac{4\max\left\{g(\tilde{Z}_{0})-g_{*},\sqrt{\left(g(\tilde{Z}_{0})-g_{*}\right)L_{g}D^{2}_{\Omega}},L_{g}D^{2}_{\Omega}\right\}}{\bar{\epsilon}}\right\rceil (132)

iterations where LgL_{g} is as in (126) and DΩD_{\Omega} is the diameter of Ω\Omega.

D.1   Proof of Theorem D.1

This subsection is dedicated to proving Theorem D.1.

The following lemma establishes important properties of the iterates ZkZ_{k} and ϵk\epsilon_{k}.

Lemma D.2.

For every k1k\geq 1, the following relations hold:

ϵkg(Zk)g,\epsilon_{k}\geq g(Z_{k})-g_{*}, (133)
0g(Zk)+ϵkδΩ(Zk),0\in\nabla g(Z_{k})+\partial_{\epsilon_{k}}\delta_{\Omega}(Z_{k}), (134)

where ϵk\epsilon_{k} is defined in (128) and gg_{*} is the optimal value of (125).

Proof.

Suppose ZΩZ^{\prime}\in\Omega. It follows from the fact gg is convex, the definitions of DkD_{k} and ϵk\epsilon_{k} in (128), and the way ZkFZ^{F}_{k} is computed in (128) that

g(Zk)ϵk\displaystyle g(Z_{k})-\epsilon_{k} =(128)g(Zk)+g(Zk),ZkFZk\displaystyle\overset{\eqref{FW algo definitions}}{=}g(Z_{k})+\langle\nabla g(Z_{k}),Z^{F}_{k}-Z_{k}\rangle
(128)g(Zk)+g(Zk),ZZkg(Z).\displaystyle\overset{\eqref{FW algo definitions}}{\leq}g(Z_{k})+\langle\nabla g(Z_{k}),Z^{\prime}-Z_{k}\rangle\leq g(Z^{\prime}). (135)

Since (D.1) holds for any ZZ^{\prime} in Ω\Omega, it must hold for the minimizer ZZ^{*} of (125), and hence g(Zk)ϵkgg(Z_{k})-\epsilon_{k}\leq g_{*}, which immediately shows relation (133).

It follows from the fact that ZkFargmin{g(U;Zk):UΩ}Z^{F}_{k}\in\operatorname*{arg\,min}\{\ell_{g}(U;Z_{k}):U\in\Omega\} that

0g(Zk)+δΩ(ZkF).0\in\nabla g(Z_{k})+\partial\delta_{\Omega}(Z^{F}_{k}).

It then follows from the above relation, the definition of ϵ\epsilon-subdifferential in (5), and the definition of ϵk\epsilon_{k} in (128) that inclusion (134) holds. ∎

The following lemma establishes that the RFW method is a descent method.

Lemma D.3.

Define

α^k:=min{1,ϵkLgDΩ2}k1.\hat{\alpha}_{k}:=\min\left\{1,\frac{\epsilon_{k}}{L_{g}D^{2}_{\Omega}}\right\}\quad\forall k\geq 1. (136)

Then the following statements hold for every k1k\geq 1:

ϵk2α^k(g(Zk)g(Z~k)),\epsilon_{k}\leq\frac{2}{\hat{\alpha}_{k}}\left(g(Z_{k})-g(\tilde{Z}_{k})\right), (137)
g(Zk+1)g(Z~k)g(Zk).g(Z_{k+1})\leq g(\tilde{Z}_{k})\leq g(Z_{k}). (138)
Proof.

It follows from the definitions of ϵk\epsilon_{k}, αk\alpha_{k}, Z~k\tilde{Z}_{k}, and α^k\hat{\alpha}_{k} in (128), (129), (130), (136), respectively, the fact that Dk2DΩ2\|D_{k}\|^{2}\leq D^{2}_{\Omega}, and from applying inequality (126) with U=Zkα^kDkU^{\prime}=Z_{k}-\hat{\alpha}_{k}D_{k} and U=ZkU=Z_{k} that

g(Z~k)\displaystyle g(\tilde{Z}_{k}) (129),(130)g(Zkα^kDk)(126)g(Zk)α^kg(Zk),Dk+α^k2Lg2DΩ2\displaystyle\overset{\eqref{FW stepsize},\eqref{Tilde X Update}}{\leq}g(Z_{k}-\hat{\alpha}_{k}D_{k})\overset{\eqref{upper curvature}}{\leq}g(Z_{k})-\hat{\alpha}_{k}\langle\nabla g(Z_{k}),D_{k}\rangle+\frac{\hat{\alpha}_{k}^{2}L_{g}}{2}D^{2}_{\Omega}
=(128)g(Zk)α^kϵk+α^k2Lg2DΩ2(136)g(Zk)α^kϵk+α^kϵk2=g(Zk)α^kϵk2\displaystyle\overset{\eqref{FW algo definitions}}{=}g(Z_{k})-\hat{\alpha}_{k}\epsilon_{k}+\frac{\hat{\alpha}_{k}^{2}L_{g}}{2}D^{2}_{\Omega}\overset{\eqref{Definition of Hat Alpha}}{\leq}g(Z_{k})-\hat{\alpha}_{k}\epsilon_{k}+\frac{\hat{\alpha}_{k}\epsilon_{k}}{2}=g(Z_{k})-\frac{\hat{\alpha}_{k}\epsilon_{k}}{2}

which immediately implies relation (137).

The first inequality in (138) follows immediately from (127). The second inequality in (138) follows immediately from relations (137) and (133). ∎

The next proposition establishes the convergence rate of the RFW method.

Proposition D.4.

For every k2k\geq 2, the following relations hold:

g(Zk)g\displaystyle g(Z_{k})-g_{*} 2k1max{g(Z~0)g,LgDΩ2},\displaystyle\leq\frac{2}{k-1}\max\left\{g(\tilde{Z}_{0})-g_{*},L_{g}D^{2}_{\Omega}\right\}, (139)
minkj<2kϵj\displaystyle\min_{k\leq j<2k}\epsilon_{j} 4k1max{g(Z~0)g,(g(Z~0)g)LgDΩ2,LgDΩ2}.\displaystyle\leq\frac{4}{k-1}\max\left\{g(\tilde{Z}_{0})-g_{*},\sqrt{\left(g(\tilde{Z}_{0})-g_{*}\right)L_{g}D^{2}_{\Omega}},L_{g}D^{2}_{\Omega}\right\}. (140)
Proof.

Define γ~0:=g(Z~0)g\tilde{\gamma}_{0}:=g(\tilde{Z}_{0})-g_{*} and let γj:=g(Zj)g\gamma_{j}:=g(Z_{j})-g_{*} for any iteration index jj. It then follows from relations (133) and (137) and relation (138) with k=jk=j that the following two relations hold

α^j2γj(133)α^j2ϵj(137)g(Zj)g(Z~j)(138)g(Zj)g(Zj+1)=γjγj+1\displaystyle\frac{\hat{\alpha}_{j}}{2}\gamma_{j}\overset{\eqref{epsilon nonnegative}}{\leq}\frac{\hat{\alpha}_{j}}{2}\epsilon_{j}\overset{\eqref{key convergence relation}}{\leq}g(Z_{j})-g(\tilde{Z}_{j})\overset{\eqref{descent tildes}}{\leq}g(Z_{j})-g(Z_{j+1})=\gamma_{j}-\gamma_{j+1} (141)
γj+1(138)γj(133)ϵj.\displaystyle\gamma_{j+1}\overset{\eqref{descent tildes}}{\leq}\gamma_{j}\overset{\eqref{epsilon nonnegative}}{\leq}\epsilon_{j}. (142)

Hence, using relations (141) and (142), relation (127) with k=1k=1, and the expression for α^j\hat{\alpha}_{j} in (136), it follows that

1γj+11γj\displaystyle\frac{1}{\gamma_{j+1}}-\frac{1}{\gamma_{j}} =γjγj+1γj+1γj(141)α^jγj2γj+1γj=(136)12γj+1min{1,ϵjLgDΩ2}\displaystyle=\frac{\gamma_{j}-\gamma_{j+1}}{\gamma_{j+1}\gamma_{j}}\overset{\eqref{Recurrence Relation-1}}{\geq}\frac{\hat{\alpha}_{j}\gamma_{j}}{2\gamma_{j+1}\gamma_{j}}\overset{\eqref{Definition of Hat Alpha}}{=}\frac{1}{2\gamma_{j+1}}\min\left\{1,\frac{\epsilon_{j}}{L_{g}D^{2}_{\Omega}}\right\} (143)
(142)12min{1γ1,1LgDΩ2}(127)12min{1γ~0,1LgDΩ2}.\displaystyle\overset{\eqref{Recurrence Relation-2}}{\geq}\frac{1}{2}\min\left\{\frac{1}{{\gamma_{1}}},\frac{1}{L_{g}D^{2}_{\Omega}}\right\}\overset{\eqref{descent condition}}{\geq}\frac{1}{2}\min\left\{\frac{1}{\tilde{\gamma}_{0}},\frac{1}{L_{g}D^{2}_{\Omega}}\right\}. (144)

It follows from summing the inequality in (143) from j=1j=1 to k1k-1 that

1γk1γk1γ1k12min{1γ~0,1LgDΩ2},\frac{1}{\gamma_{k}}\geq\frac{1}{\gamma_{k}}-\frac{1}{\gamma_{1}}\geq\frac{k-1}{2}\min\left\{\frac{1}{\tilde{\gamma}_{0}},\frac{1}{L_{g}D^{2}_{\Omega}}\right\}, (145)

which, together, with the definition of γk\gamma_{k} implies relation (139).

It follows from summing the relation in (137) from j=kj=k to j=2k+1j=2k+1 and relations (138) and (145) that

2k1max{γ~0,LgDΩ2}\displaystyle\frac{2}{k-1}\max\{\tilde{\gamma}_{0},L_{g}D^{2}_{\Omega}\} (145)γkg(Zk)g(Z2k+1)\displaystyle\overset{\eqref{intermediary function value}}{\geq}\gamma_{k}\geq g(Z_{k})-g(Z_{2k+1})
=j=k2kg(Zj)g(Zj+1)(138)j=k2kg(Zj)g(Z~j)(137)j=k2kα^j2ϵj.\displaystyle=\sum_{j=k}^{2k}g(Z_{j})-g(Z_{j+1})\overset{\eqref{descent tildes}}{\geq}\sum_{j=k}^{2k}g(Z_{j})-g(\tilde{Z}_{j})\overset{\eqref{key convergence relation}}{\geq}\sum_{j=k}^{2k}\frac{\hat{\alpha}_{j}}{2}\epsilon_{j}. (146)

It now follows from relation (146) and the definition of α^j\hat{\alpha}_{j} in (136) that

4(k1)2max{γ~0,LgDΩ2}(146)minkj2kα^jϵj(136)minkj2k{1,ϵjLgDΩ2}minkj2kϵj,\displaystyle\frac{4}{(k-1)^{2}}\max\left\{\tilde{\gamma}_{0},L_{g}D^{2}_{\Omega}\right\}\overset{\eqref{Final FW inequality}}{\geq}\min_{k\leq j\leq 2k}\hat{\alpha}_{j}\epsilon_{j}\overset{\eqref{Definition of Hat Alpha}}{\geq}\min_{k\leq j\leq 2k}\left\{1,\frac{\epsilon_{j}}{L_{g}D^{2}_{\Omega}}\right\}\min_{k\leq j\leq 2k}\epsilon_{j}, (147)

which implies relation (140) in view of the definition of γ~0\tilde{\gamma}_{0}. ∎

We are now ready to prove Theorem D.1.

Proof of Theorem D.1.

The stopping criterion in step 3 of the RFW method and relation (134) immediately imply that output Z¯\bar{Z} satisfies relation (131).

In view of the stopping criterion in step 3 of the RFW method, the iteration complexity result in (132) follows immediately from relation (140). ∎

Appendix E AL method for linearly-constrained convex optimization

This section is dedicated to analyzing the convergence of the augmented Lagrangian framework for solving linearly-constrained convex optimization problems.

Let 𝔼\mathbb{E} denote an Euclidean space, 𝒜:𝔼m\mathcal{A}:\mathbb{E}\to{\mathbb{R}}^{m} be a linear operator, bmb\in{\mathbb{R}}^{m}, f:𝔼f:\mathbb{E}\to{\mathbb{R}} be a differentiable convex function, and h:𝔼(,]h:\mathbb{E}\to(-\infty,\infty] be a closed proper convex function. Consider the linearly-constrained convex optimization problem

min{ϕ(X):=f(X)+h(X):𝒜X=b},\min\{\phi(X):=f(X)+h(X):\mathcal{A}X=b\}, (148)

where the domain of hh has finite diameter DhD_{h}. The following assumption is also made.

Assumption E.1.

There exists (X,p)(X_{*},p_{*}) such that

0f(X)+h(X)+𝒜p,𝒜Xb=0\displaystyle 0\in\nabla f(X_{*})+\partial h(X_{*})+\mathcal{A}^{*}p_{*},\quad\mathcal{A}X_{*}-b=0 (149)

Given a previous dual iterate pt1p_{t-1}, the AL framework finds the next primal iterate XtX_{t} by

XtargminXβ(X;pt1)X_{t}\approx\operatorname*{arg\,min}_{X}\mathcal{L}_{\beta}(X;p_{t-1}) (150)

where

β(X;p):=f(X)+h(X)+p,𝒜Xb+β2𝒜Xb2{\cal L}_{\beta}(X;p):=f(X)+h(X)+\left\langle p,\mathcal{A}X-b\right\rangle+\frac{\beta}{2}\|\mathcal{A}X-b\|^{2} (151)

is the augmented Lagrangian function and β>0\beta>0 is a fixed penalty parameter. We assume the existence of a blackbox that inexactly solves such minimization problems as in (150).

Blackbox AL.

Given a pair (ϵ^c,ϵ^d)+2(\hat{\epsilon}_{\mathrm{c}},\hat{\epsilon}_{\mathrm{d}})\in\mathbb{R}^{2}_{+} and convex functions g:𝔼g:\mathbb{E}\to{\mathbb{R}} and h:𝔼h:\mathbb{E}\to{\mathbb{R}}, the blackbox returns a pair (X^,R^)(\hat{X},\hat{R}) satisfying

X^domh,R^g(X^)+ϵ^ch(X^),R^ϵ^d.\hat{X}\in\mathrm{dom}\,h,\quad\hat{R}\in\nabla g(\hat{X})+\partial_{\hat{\epsilon}_{\mathrm{c}}}h(\hat{X}),\qquad\|\hat{R}\|\leq\hat{\epsilon}_{\mathrm{d}}. (152)

The AL framework is now presented formally below.

 

AL Framework

 

Input: p0mp_{0}\in{\mathbb{R}}^{m}, tolerances ϵp\epsilon_{\mathrm{p}} > 0, ϵd0,ϵc0\epsilon_{\mathrm{d}}\geq 0,\epsilon_{\mathrm{c}}\geq 0, and penalty parameter β>0\beta>0.

Output: triple (X¯,p¯,R¯)(\bar{X},\bar{p},\bar{R}).

  • 0.

    Set t=1t=1 and

    ϵ^c=min{ϵc,βϵp2/6},ϵ^d=min{ϵd,βϵp2/(6Dh)};\hat{\epsilon}_{\mathrm{c}}=\min\{\epsilon_{\mathrm{c}},\;{\beta\epsilon_{\mathrm{p}}^{2}}/{6}\},\quad\hat{\epsilon}_{\mathrm{d}}=\min\{\epsilon_{\mathrm{d}},\;{\beta\epsilon_{\mathrm{p}}^{2}}/(6D_{h})\}; (153)
  • 1.

    Call the Blackbox AL with tolerance pair (ϵ^c,ϵ^d)(\hat{\epsilon}_{\mathrm{c}},\hat{\epsilon}_{\mathrm{d}}) and functions h=hh=h and g()=β(;pt1)g(\cdot)=\mathcal{L}_{\beta}(\cdot;p_{t-1}) and let (Xt,Rt)(X_{t},R_{t}) be its output;

  • 2.

    Set

    pt=pt1+β(𝒜Xtb);p_{t}=p_{t-1}+\beta(\mathcal{A}X_{t}-b); (154)
  • 3.

    If 𝒜Xtbϵp\|\mathcal{A}X_{t}-b\|\leq\epsilon_{\mathrm{p}}, then set T=tT=t and return (XT,pT,RT)(X_{T},p_{T},R_{T});

  • 4.

    Set tt+1t\leftarrow t+1 and go to step 1.

 

The following result states the main iteration complexity of the AL framework and establishes the boundedness of its sequence of Lagrange multipliers. The proof of the result is given in the next subsection.

Theorem E.1.

Under Assumption E.1, the following statements about the AL framework hold:

  • (a)

    the AL framework terminates with an iterate (XT,pT,RT)domh×m×𝔼(X_{T},p_{T},R_{T})\in\mathrm{dom}\,h\times\mathbb{R}^{m}\times\mathbb{E} such that

    RTf(XT)+ϵch(XT)+𝒜pT,RTϵd,𝒜XTbϵp\displaystyle R_{T}\in\nabla f(X_{T})+\partial_{\epsilon_{\mathrm{c}}}h(X_{T})+\mathcal{A}^{*}p_{T},\quad\|R_{T}\|\leq\epsilon_{\mathrm{d}},\quad\|\mathcal{A}X_{T}-b\|\leq\epsilon_{\mathrm{p}} (155)

    and

    T3pp02β2ϵp2;T\leq\left\lceil\frac{3\|p_{*}-p_{0}\|^{2}}{\beta^{2}\epsilon_{\mathrm{p}}^{2}}\right\rceil; (156)
  • (b)

    there hold

    maxt{0,T}ptp+3pp02+2β(Dhϵ^d+ϵ^c),\max_{t\in\{0,\ldots T\}}\|p_{t}\|\leq\|p_{*}\|+\sqrt{3\|p_{*}-p_{0}\|^{2}+2\beta(D_{h}\hat{\epsilon}_{\mathrm{d}}+\hat{\epsilon}_{\mathrm{c}})}, (157)
    β2l=1T𝒜Xlb23pp02+2β(Dhϵ^d+ϵ^c),\beta^{2}\sum_{l=1}^{T}\|\mathcal{A}X_{l}-b\|^{2}\leq 3\|p_{*}-p_{0}\|^{2}+2\beta(D_{h}\hat{\epsilon}_{\mathrm{d}}+\hat{\epsilon}_{\mathrm{c}}), (158)

where pp_{*} is an optimal Lagrange multiplier and ϵ^c\hat{\epsilon}_{\mathrm{c}} and ϵ^d\hat{\epsilon}_{\mathrm{d}} are as in (153).

E.1   Proof of Theorem E.1

This subsection is dedicated to proving Theorem E.1. The proof relies on the following two preliminary lemmas.

Lemma E.2.

For any t1t\geq 1, the following relation holds

𝒜Xtb,pptRt,XXtϵ^c,\langle\mathcal{A}X_{t}-b,p_{*}-p_{t}\rangle\geq\langle R_{t},X_{*}-X_{t}\rangle-\hat{\epsilon}_{\mathrm{c}}, (159)

where (X,p)(X_{*},p_{*}) is an optimal primal-dual pair of (148).

Proof.

Since (X,p)(X_{*},p_{*}) is an optimal primal-dual pair, it follows that

0ϕ(X)+𝒜p,𝒜X=b,0\in\partial\phi(X_{*})+\mathcal{A}^{*}p_{*},\quad\mathcal{A}X_{*}=b, (160)

where ϕ\phi is as in (148). Relation (152) implies that

Rtf(Xt)+ϵ^ch(Xt)+𝒜ptϵ^cϕ(Xt)+𝒜pt.R_{t}\in\nabla f(X_{t})+\partial_{\hat{\epsilon}_{\mathrm{c}}}h(X_{t})+\mathcal{A}^{*}p_{t}\subseteq\partial_{\hat{\epsilon}_{\mathrm{c}}}\phi(X_{t})+\mathcal{A}^{*}p_{t}. (161)

It follows from relations (160) and (161) and the definition of ϵ^c\hat{\epsilon}_{\mathrm{c}}-subdifferential that

ϕ(Xt)ϕ(X)\displaystyle\phi(X_{t})-\phi(X_{*}) (160)𝒜p,XtX\displaystyle\overset{\eqref{optimality inclusion}}{\geq}\langle-\mathcal{A}^{*}p_{*},X_{t}-X_{*}\rangle
ϕ(X)ϕ(Xt)\displaystyle\phi(X_{*})-\phi(X_{t}) (161)Rt𝒜pt,XXtϵ^c.\displaystyle\overset{\eqref{inclusion R}}{\geq}\langle R_{t}-\mathcal{A}^{*}p_{t},X_{*}-X_{t}\rangle-\hat{\epsilon}_{\mathrm{c}}.

Adding the two above relations implies that

𝒜(ppt),XtXRt,XXtϵ^c.\langle\mathcal{A}^{*}(p_{*}-p_{t}),X_{t}-X_{*}\rangle\geq\langle R_{t},X_{*}-X_{t}\rangle-\hat{\epsilon}_{\mathrm{c}}. (162)

Relations (160) and (162) then imply that

𝒜Xtb,ppt=(160)𝒜(XtX),ppt=XtX,𝒜(ppt)(162)Rt,XXtϵ^c,\langle\mathcal{A}X_{t}-b,p_{*}-p_{t}\rangle\overset{\eqref{optimality inclusion}}{=}\langle\mathcal{A}(X_{t}-X_{*}),p_{*}-p_{t}\rangle=\langle X_{t}-X_{*},\mathcal{A}^{*}(p_{*}-p_{t})\rangle\overset{\eqref{monotonicity}}{\geq}\langle R_{t},X_{*}-X_{t}\rangle-\hat{\epsilon}_{\mathrm{c}},

from which the result immediately follows. ∎

Lemma E.3.

For any iteration index tt of the AL framework, there holds:

β2l=1t𝒜Xlb2pp02ppt2+2βt(Dhϵ^d+ϵ^c).\beta^{2}\sum_{l=1}^{t}\|\mathcal{A}X_{l}-b\|^{2}\leq\|p_{*}-p_{0}\|^{2}-\|p_{*}-p_{t}\|^{2}+2\beta t(D_{h}\hat{\epsilon}_{\mathrm{d}}+\hat{\epsilon}_{\mathrm{c}}). (163)
Proof.

Let tt be an iteration index of the AL framework and suppose ltl\leq t. By completing the square and using relation (154), it follows that

ppl12ppl2\displaystyle\|p_{*}-p_{l-1}\|^{2}-\|p_{*}-p_{l}\|^{2} =pl1pl2+2plpl1,ppl\displaystyle=\|p_{l-1}-p_{l}\|^{2}+2\langle p_{l}-p_{l-1},p_{*}-p_{l}\rangle
=(154)β2𝒜Xlb2+2β𝒜Xlb,ppl.\displaystyle\overset{\eqref{dual update}}{=}\beta^{2}\|\mathcal{A}X_{l}-b\|^{2}+2\beta\langle\mathcal{A}X_{l}-b,p_{*}-p_{l}\rangle. (164)

Moreover, relation (159), the definition of DhD_{h}, the Cauchy-Schwarz inequality, and the fact that the Blackbox is called in step 1 with tolerance ϵ^d\hat{\epsilon}_{d}, imply that

2β𝒜Xlb,ppl(159)2βRl,XXl2βϵ^c2βDhϵ^d2βϵ^c.\displaystyle 2\beta\langle\mathcal{A}X_{l}-b,p_{*}-p_{l}\rangle\overset{\eqref{inner product bound}}{\geq}2\beta\langle R_{l},X_{*}-X_{l}\rangle-2\beta\hat{\epsilon}_{\mathrm{c}}\geq-2\beta D_{h}\hat{\epsilon}_{\mathrm{d}}-2\beta\hat{\epsilon}_{\mathrm{c}}. (165)

Combining relations (E.1) and (165), we then conclude that

ppl12ppl2β2𝒜Xlb22βDhϵ^d2βϵ^c.\displaystyle\|p_{*}-p_{l-1}\|^{2}-\|p_{*}-p_{l}\|^{2}\geq\beta^{2}\|\mathcal{A}X_{l}-b\|^{2}-2\beta D_{h}\hat{\epsilon}_{\mathrm{d}}-2\beta\hat{\epsilon}_{\mathrm{c}}. (166)

The conclusion of the lemma now follows by summing relation (166) from l=1l=1 to tt. ∎

We are now ready to prove Theorem E.1

Proof of Theorem E.1.

(a) Let tt be an iteration index of the AL framework. The fact that the Blackbox AL is called in step 1 with inputs gg and (ϵ^c,ϵ^d)(\hat{\epsilon}_{\mathrm{c}},\hat{\epsilon}_{\mathrm{d}}) implies that its output (Xt,Rt)(X_{t},R_{t}) satisfies that Rtϵ^d\|R_{t}\|\leq\hat{\epsilon}_{\mathrm{d}} and also

Rtg(Xt)+ϵ^ch(Xt)\displaystyle R_{t}\in\nabla g(X_{t})+\partial_{\hat{\epsilon}_{\mathrm{c}}}h(X_{t}) =f(Xt)+𝒜(pt1+β(𝒜Xtb))+ϵ^ch(Xt)\displaystyle=\nabla f(X_{t})+\mathcal{A}^{*}(p_{t-1}+\beta(\mathcal{A}X_{t}-b))+\partial_{\hat{\epsilon}_{\mathrm{c}}}h(X_{t})
=f(Xt)+ϵ^ch(Xt)+𝒜pt.\displaystyle=\nabla f(X_{t})+\partial_{\hat{\epsilon}_{\mathrm{c}}}h(X_{t})+\mathcal{A}^{*}p_{t}.

Since ϵ^cϵc\hat{\epsilon}_{\mathrm{c}}\leq\epsilon_{\mathrm{c}} and ϵ^dϵd\hat{\epsilon}_{\mathrm{d}}\leq\epsilon_{\mathrm{d}}, it follows that

Rtf(Xt)+ϵch(Xt)+𝒜(pt),Rtϵd.R_{t}\in\nabla f(X_{t})+\partial_{\epsilon_{\mathrm{c}}}h(X_{t})+\mathcal{A}^{*}(p_{t}),\qquad\|R_{t}\|\leq\epsilon_{\mathrm{d}}.

Since the above relations hold for any iteration index tt, the output (XT,pT,RT)(X_{T},p_{T},R_{T}) of the AL framework satisfies the first two relations in (155). It remains to show that the AL framework terminates and that its last iteration index TT satisfies (156). Suppose by contradiction that the AL framework generates an iteration index t^\hat{t} satisfying

t^>3pp02β2ϵp2.\hat{t}>\left\lceil\frac{3\|p_{*}-p_{0}\|^{2}}{\beta^{2}\epsilon_{\mathrm{p}}^{2}}\right\rceil. (167)

In view of the stopping criterion of step 3 of the AL framework, this implies that 𝒜Xtb>ϵp\|\mathcal{A}X_{t}-b\|>\epsilon_{\mathrm{p}} for every t=1,t^1t=1,\ldots\hat{t}-1. Using this conclusion, relation (163) with t=t^1t=\hat{t}-1, and the definitions of ϵ^c\hat{\epsilon}_{\mathrm{c}} and ϵ^d\hat{\epsilon}_{\mathrm{d}} in (153), it follows that

(t^1)ϵp2\displaystyle(\hat{t}-1)\epsilon^{2}_{p} <l=1t^1𝒜Xlb2(163)pp02β2+(t^1)2β(Dhϵ^d+ϵ^c)β2\displaystyle<\sum_{l=1}^{\hat{t}-1}\|\mathcal{A}X_{l}-b\|^{2}\overset{\eqref{final inequality}}{\leq}\frac{\|p_{*}-p_{0}\|^{2}}{\beta^{2}}+(\hat{t}-1)\frac{2\beta(D_{h}\hat{\epsilon}_{\mathrm{d}}+\hat{\epsilon}_{\mathrm{c}})}{\beta^{2}}
(153)pp02β2+(t^1)2ϵp23\displaystyle\overset{\eqref{epsilon hat}}{\leq}\frac{\|p_{*}-p_{0}\|^{2}}{\beta^{2}}+(\hat{t}-1)\frac{2\epsilon^{2}_{p}}{3} (168)

which clearly contradicts the bound on t^\hat{t} in (167). Hence, in view of this conclusion and the termination criterion in step 3, the AL framework must terminate with final iteration index TT satisfying (156) and output (XT,pT,RT)(X_{T},p_{T},R_{T}) satisfying the third relation in (155).

(b) Let tTt\leq T where TT is the final iteration index of the AL framework. It then follows from taking square root of relation (163) and triangle inequality that

pt(163)p+pp02+2βt(Dhϵ^d+ϵ^c)p+pp02+2βT(Dhϵ^d+ϵ^c).\|p_{t}\|\overset{\eqref{final inequality}}{\leq}\|p_{*}\|+\sqrt{\|p_{*}-p_{0}\|^{2}+2\beta t(D_{h}\hat{\epsilon}_{\mathrm{d}}+\hat{\epsilon}_{\mathrm{c}})}\leq\|p_{*}\|+\sqrt{\|p_{*}-p_{0}\|^{2}+2\beta T(D_{h}\hat{\epsilon}_{\mathrm{d}}+\hat{\epsilon}_{\mathrm{c}})}. (169)

The fact that TT satisfies relation (156) and the definitions of ϵ^c\hat{\epsilon}_{\mathrm{c}} and ϵ^d\hat{\epsilon}_{\mathrm{d}} in (153) then imply that

2βT(Dhϵ^d+ϵ^c)(156)6(Dhϵ^d+ϵ^c)βϵp2pp02+2β(Dhϵ^d+ϵ^c)(153)2pp02+2β(Dhϵ^d+ϵ^c).2\beta T(D_{h}\hat{\epsilon}_{\mathrm{d}}+\hat{\epsilon}_{\mathrm{c}})\overset{\eqref{Complexity Result}}{\leq}\frac{6(D_{h}\hat{\epsilon}_{\mathrm{d}}+\hat{\epsilon}_{\mathrm{c}})}{\beta\epsilon^{2}_{p}}\|p_{*}-p_{0}\|^{2}+2\beta(D_{h}\hat{\epsilon}_{\mathrm{d}}+\hat{\epsilon}_{\mathrm{c}})\overset{\eqref{epsilon hat}}{\leq}2\|p_{*}-p_{0}\|^{2}+2\beta(D_{h}\hat{\epsilon}_{\mathrm{d}}+\hat{\epsilon}_{\mathrm{c}}). (170)

Relation (157) then immediately follows from combining relations (169) and (170).

It follows from relation (163) with t=Tt=T that

β2l=1T𝒜Xlb2(163)pp02+2βT(Dhϵ^d+ϵ^c).\beta^{2}\sum_{l=1}^{T}\|\mathcal{A}X_{l}-b\|^{2}\overset{\eqref{final inequality}}{\leq}\|p_{*}-p_{0}\|^{2}+2\beta T(D_{h}\hat{\epsilon}_{\mathrm{d}}+\hat{\epsilon}_{\mathrm{c}}). (171)

Combining relations (170) and (171) then immediately implies inequality (158). ∎

References

  • [1] David A. Bader, Henning Meyerhenke, Peter Sanders, and Dorothea Wagner, editors. Graph Partitioning and Graph Clustering, 10th DIMACS Implementation Challenge Workshop, Georgia Institute of Technology, Atlanta, GA, USA, February 13-14, 2012. Proceedings, volume 588 of Contemporary Mathematics. American Mathematical Society, 2013.
  • [2] Alexander I. Barvinok. Problems of distance geometry and convex properties of quadratic maps. Discrete & Computational Geometry, 13:189–202, 1995.
  • [3] Srinadh Bhojanapalli, Nicolas Boumal, Prateek Jain, and Praneeth Netrapalli. Smoothed analysis for low-rank solutions to semidefinite programs in quadratic penalty form. In Conference On Learning Theory, pages 3243–3270. PMLR, 2018.
  • [4] Nicolas Boumal, Vlad Voroninski, and Afonso Bandeira. The non-convex burer-monteiro approach works on smooth semidefinite programs. Advances in Neural Information Processing Systems, 29, 2016.
  • [5] Nicolas Boumal, Vladislav Voroninski, and Afonso S Bandeira. Deterministic guarantees for burer-monteiro factorizations of smooth semidefinite programs. Communications on Pure and Applied Mathematics, 73(3):581–608, 2020.
  • [6] Samuel Burer and Renato DC Monteiro. A projected gradient algorithm for solving the maxcut SDP relaxation. Optimization methods and Software, 15(3-4):175–200, 2001.
  • [7] Samuel Burer and Renato DC Monteiro. A nonlinear programming algorithm for solving semidefinite programs via low-rank factorization. Mathematical programming, 95(2):329–357, 2003.
  • [8] Samuel Burer and Renato DC Monteiro. Local minima and convergence in low-rank semidefinite programming. Mathematical programming, 103(3):427–444, 2005.
  • [9] M.L. Overton C. Helmberg and F. Rendl. The spectral bundle method with second-order information. Optimization Methods and Software, 29(4):855–876, 2014.
  • [10] Emmanuel Candes and Benjamin Recht. Exact matrix completion via convex optimization. Communications of the ACM, 55(6):111–119, 2012.
  • [11] Emmanuel J Candes, Xiaodong Li, and Mahdi Soltanolkotabi. Phase retrieval from coded diffraction patterns. Applied and Computational Harmonic Analysis, 39(2):277–299, 2015.
  • [12] Emmanuel J Candes, Thomas Strohmer, and Vladislav Voroninski. Phaselift: Exact and stable signal recovery from magnitude measurements via convex programming. Communications on Pure and Applied Mathematics, 66(8):1241–1274, 2013.
  • [13] Y. Carmon, J. C. Duchi, O. Hinder, and A. Sidford. Accelerated methods for nonconvex optimization. SIAM J. Optim., 28(2):1751–1772, 2018.
  • [14] Diego Cifuentes. On the Burer–Monteiro method for general semidefinite programs. Optimization Letters, 15(6):2299–2309, 2021.
  • [15] Diego Cifuentes and Ankur Moitra. Polynomial time guarantees for the Burer-Monteiro method. Advances in Neural Information Processing Systems, 35:23923–23935, 2022.
  • [16] Andrew R Conn, Nicholas IM Gould, and Philippe Toint. A globally convergent augmented Lagrangian algorithm for optimization with general constraints and simple bounds. SIAM Journal on Numerical Analysis, 28(2):545–572, 1991.
  • [17] Timothy A. Davis and Yifan Hu. The university of florida sparse matrix collection. ACM Trans. Math. Softw., 38(1), dec 2011.
  • [18] Qi Deng, Qing Feng, Wenzhi Gao, Dongdong Ge, Bo Jiang, Yuntian Jiang, Jingsong Liu, Tianhao Liu, Chenyu Xue, Yinyu Ye, et al. New developments of ADMM-based interior point methods for linear programming and conic programming. arXiv preprint arXiv:2209.01793, 2022.
  • [19] Lijun Ding and Benjamin Grimmer. Revisiting spectral bundle methods: Primal-dual (sub)linear convergence rates. SIAM Journal on Optimization, 33(2):1305–1332, 2023.
  • [20] Lijun Ding, Alp Yurtsever, Volkan Cevher, Joel A Tropp, and Madeleine Udell. An optimal-storage approach to semidefinite programming using approximate complementarity. SIAM Journal on Optimization, 31(4):2695–2725, 2021.
  • [21] Murat A Erdogdu, Asuman Ozdaglar, Pablo A Parrilo, and Nuri Denizcan Vanli. Convergence rate of block-coordinate maximization Burer–Monteiro method for solving large SDPs. Mathematical Programming, 195(1-2):243–281, 2022.
  • [22] Maryam Fazel. Matrix rank minimization with applications. PhD thesis, PhD thesis, Stanford University, 2002.
  • [23] Robert M Freund, Paul Grigas, and Rahul Mazumder. An extended Frank–Wolfe method with “in-face” directions, and its application to low-rank matrix completion. SIAM Journal on optimization, 27(1):319–346, 2017.
  • [24] Mituhiro Fukuda, Masakazu Kojima, Kazuo Murota, and Kazuhide Nakata. Exploiting sparsity in semidefinite programming via matrix completion i: General framework. SIAM Journal on Optimization, 11(3):647–674, 2001.
  • [25] Michael Garstka, Mark Cannon, and Paul Goulart. Cosmo: A conic operator splitting method for convex conic problems. Journal of Optimization Theory and Applications, 190(3):779–810, 2021.
  • [26] Robert Grone, Charles R. Johnson, Eduardo M. Sá, and Henry Wolkowicz. Positive definite completions of partial hermitian matrices. Linear Algebra and its Applications, 58:109–124, 1984.
  • [27] Martin Grötschel, László Lovász, and Alexander Schrijver. Polynomial algorithms for perfect graphs. In North-Holland mathematics studies, volume 88, pages 325–356. Elsevier, 1984.
  • [28] Zaid Harchaoui, Anatoli Juditsky, and Arkadi Nemirovski. Conditional gradient algorithms for norm-regularized smooth convex optimization. Math. Program., 152(1-2):75–112, 2015.
  • [29] Elad Hazan. Sparse approximate solutions to semidefinite programs. In Latin American symposium on theoretical informatics, pages 306–316. Springer, 2008.
  • [30] C. Helmberg and K.C. Kiwiel. A spectral bundle method with bounds. Math. Program., 93(2):173–194, 2002.
  • [31] C. Helmberg and F. Rendl. A spectral bundle method for semidefinite programming. SIAM Journal on Optimization, 10(3):673–696, 2000.
  • [32] Steven Homer and Marcus Peinado. Design and performance of parallel and distributed approximation algorithms for maxcut. Journal of Parallel and Distributed Computing, 46(1):48–61, 1997.
  • [33] Wen Huang, Kyle A Gallivan, and Xiangxiong Zhang. Solving PhaseLift by low-rank Riemannian optimization methods for complex semidefinite constraints. SIAM Journal on Scientific Computing, 39(5):B840–B859, 2017.
  • [34] Martin Jaggi and Marek Sulovský. A simple algorithm for nuclear norm regularized problems. In Proceedings of the 27th International Conference on International Conference on Machine Learning, ICML’10, page 471–478, Madison, WI, USA, 2010. Omnipress.
  • [35] Michel Journée, Francis Bach, P-A Absil, and Rodolphe Sepulchre. Low-rank optimization on the cone of positive semidefinite matrices. SIAM Journal on Optimization, 20(5):2327–2351, 2010.
  • [36] W. Kong, J.G. Melo, and R.D.C. Monteiro. Complexity of a quadratic penalty accelerated inexact proximal point method for solving linearly constrained nonconvex composite programs. SIAM J. Optim., 29(4):2566–2593, 2019.
  • [37] W. Kong, J.G. Melo, and R.D.C. Monteiro. An efficient adaptive accelerated inexact proximal point method for solving linearly constrained nonconvex composite problems. Comput. Optim. Appl., 76(2):305–346, 2019.
  • [38] Brian Kulis, Arun C Surendran, and John C Platt. Fast low-rank semidefinite programming for embedding and clustering. In Artificial Intelligence and Statistics, pages 235–242. PMLR, 2007.
  • [39] Jure Leskovec and Andrej Krevl. SNAP Datasets: Stanford large network dataset collection. http://snap.stanford.edu/data, June 2014.
  • [40] László Lovász. On the Shannon capacity of a graph. IEEE Transactions on Information theory, 25(1):1–7, 1979.
  • [41] Ramtin Madani, Abdulrahman Kalbat, and Javad Lavaei. ADMM for sparse semidefinite programming with applications to optimal power flow problem. In 2015 54th IEEE Conference on Decision and Control (CDC), pages 5932–5939. IEEE, 2015.
  • [42] Anirudha Majumdar, Georgina Hall, and Amir Ali Ahmadi. Recent scalability improvements for semidefinite programming with applications in machine learning, control, and robotics. Annual Review of Control, Robotics, and Autonomous Systems, 3:331–360, 2020.
  • [43] Song Mei, Theodor Misiakiewicz, Andrea Montanari, and Roberto Imbuzeiro Oliveira. Solving SDPs for synchronization and MaxCut problems via the G rothendieck inequality. In Conference on learning theory, pages 1476–1515. PMLR, 2017.
  • [44] F. Oustry. A second-order bundle method to minimize the maximum eigenvalue function. Math. Program., 89(1):1–33, 2000.
  • [45] Brendan O’donoghue, Eric Chu, Neal Parikh, and Stephen Boyd. Conic optimization via operator splitting and homogeneous self-dual embedding. Journal of Optimization Theory and Applications, 169:1042–1068, 2016.
  • [46] C. Paquette, H. Lin, D. Drusvyatskiy, J. Mairal, and Z. Harchaoui. Catalyst for gradient-based nonconvex optimization. In AISTATS 2018-21st International Conference on Artificial Intelligence and Statistics, pages 1–10, 2018.
  • [47] Gábor Pataki. On the rank of extreme matrices in semidefinite programs and the multiplicity of optimal eigenvalues. Mathematics of operations research, 23(2):339–358, 1998.
  • [48] Thomas Pumir, Samy Jelassi, and Nicolas Boumal. Smoothed analysis of the low-rank approach for smooth semidefinite programs. Advances in Neural Information Processing Systems, 31, 2018.
  • [49] Nikhil Rao, Parikshit Shah, and Stephen Wright. Conditional gradient with enhancement and truncation for atomic-norm regularization. In NIPS workshop on Greedy Algorithms. Citeseer, 2013.
  • [50] James Renegar. Accelerated first-order methods for hyperbolic programming. Mathematical Programming, 173(1-2):1–35, 2019.
  • [51] David M Rosen. Scalable low-rank semidefinite programming for certifiably correct machine perception. In Algorithmic Foundations of Robotics XIV: Proceedings of the Fourteenth Workshop on the Algorithmic Foundations of Robotics 14, pages 551–566. Springer, 2021.
  • [52] David M Rosen, Luca Carlone, Afonso S Bandeira, and John J Leonard. SE-Sync: A certifiably correct algorithm for synchronization over the special Euclidean group. The International Journal of Robotics Research, 38(2-3):95–125, 2019.
  • [53] Ryan A. Rossi and Nesreen K. Ahmed. The network data repository with interactive graph analytics and visualization. In AAAI, 2015.
  • [54] Alexander Shapiro. Rank-reducibility of a symmetric matrix and sampling theory of minimum trace factor analysis. Psychometrika, 47:187–199, 1982.
  • [55] Nimita Shinde, Vishnu Narayanan, and James Saunderson. Memory-efficient structured convex optimization via extreme point sampling. SIAM Journal on Mathematics of Data Science, 3(3):787–814, 2021.
  • [56] A. Sujanani and R.D.C. Monteiro. An adaptive superfast inexact proximal augmented Lagrangian method for smooth nonconvex composite optimization problems. J. Scientific Computing, 97(2), 2023.
  • [57] Lieven Vandenberghe, Martin S Andersen, et al. Chordal graphs and semidefinite optimization. Foundations and Trends® in Optimization, 1(4):241–433, 2015.
  • [58] Irene Waldspurger and Alden Waters. Rank optimality for the Burer–Monteiro factorization. SIAM journal on Optimization, 30(3):2577–2602, 2020.
  • [59] Alex L Wang and Fatma Kilinc-Karzan. Accelerated first-order methods for a class of semidefinite programs. arXiv preprint arXiv:2206.00224, 2022.
  • [60] Liuqin Yang, Defeng Sun, and Kim-Chuan Toh. SDPNAL+: a majorized semismooth Newton-CG augmented L agrangian method for semidefinite programming with nonnegative constraints. Mathematical Programming Computation, 7(3):331–366, 2015.
  • [61] Yinyu Ye. Gset dataset of random graphs. https://www.cise.ufl.edu/research/sparse/matrices/Gset, 2003.
  • [62] Alp Yurtsever, Olivier Fercoq, and Volkan Cevher. A conditional-gradient-based augmented lagrangian framework. In International Conference on Machine Learning, pages 7272–7281. PMLR, 2019.
  • [63] Alp Yurtsever, Ya-Ping Hsieh, and Volkan Cevher. Scalable convex methods for phase retrieval. In 2015 IEEE 6th International Workshop on Computational Advances in Multi-Sensor Adaptive Processing (CAMSAP), pages 381–384. IEEE, 2015.
  • [64] Alp Yurtsever, Joel A Tropp, Olivier Fercoq, Madeleine Udell, and Volkan Cevher. Scalable semidefinite programming. SIAM Journal on Mathematics of Data Science, 3(1):171–200, 2021.
  • [65] Xin-Yuan Zhao, Defeng Sun, and Kim-Chuan Toh. A newton-cg augmented lagrangian method for semidefinite programming. SIAM Journal on Optimization, 20(4):1737–1765, 2010.
  • [66] Yang Zheng, Giovanni Fantuzzi, Antonis Papachristodoulou, Paul Goulart, and Andrew Wynn. Fast ADMM for semidefinite programs with chordal sparsity. In 2017 American Control Conference (ACC), pages 3335–3340. IEEE, 2017.