An Inexact Frank-Wolfe Algorithm for Composite Convex Optimization Involving a Self-Concordant Function
Abstract
In this paper, we consider Frank-Wolfe-based algorithms for composite convex optimization problems with objective involving a logarithmically-homogeneous, self-concordant functions. Recent Frank-Wolfe-based methods for this class of problems assume an oracle that returns exact solutions of a linearized subproblem. We relax this assumption and propose a variant of the Frank-Wolfe method with inexact oracle for this class of problems. We show that our inexact variant enjoys similar convergence guarantees to the exact case, while allowing considerably more flexibility in approximately solving the linearized subproblem. In particular, our approach can be applied if the subproblem can be solved prespecified additive error or to prespecified relative error (even though the optimal value of the subproblem may not be uniformly bounded). Furthermore, our approach can also handle the situation where the subproblem is solved via a randomized algorithm that fails with positive probability. Our inexact oracle model is motivated by certain large-scale semidefinite programs where the subproblem reduces to computing an extreme eigenvalue-eigenvector pair, and we demonstrate the practical performance of our algorithm with numerical experiments on problems of this form.
1 Introduction
Consider a composite optimization problem
(SC-OPT) |
where is a linear map, is a convex, -logarithmically-homogeneous, -self-concordant barrier for a proper cone , and is a proper, closed, convex and possibly nonsmooth function with compact domain. We further assume that , ensuring that there exists at least one feasible point for (SC-OPT). Self-concordant functions are thrice differentiable convex functions whose third directional derivatives are bounded by their second directional derivatives (see Section 2 for the exact definition and key properties of these functions). Self-concordant functions appear in several applications such as machine learning, and image processing. Several second-order [28, 26, 27, 15] as well as first-order [13, 6, 7, 36] methods have been used to solve (SC-OPT) (see Section 1.4 for more details).
For large-scale applications of (SC-OPT), one natural algorithmic approach is to use the Frank-Wolfe algorithm [8], which proceeds by partially linearizing the objective at the current iterate , solving the linearized problem
(Lin-OPT) |
and taking the next iterate to be a convex combination of and an optimal point for (Lin-OPT). It is challenging to analyse the convergence of such first-order methods for self-concordant functions because their gradients are generally unbounded on their domains, and so are not globally Lipschitz continuous. Despite this, when is -logarithmically homogeneous and self-concordant, Zhao and Freund [36] established convergence guarantees for the Frank-Wolfe method when the step direction and step size are carefully chosen in a way that depends on the optimal solution of (Lin-OPT) (see Algorithm 1). This limits the applicability of the method in situations where it may not be possible to solve (Lin-OPT) to high accuracy under reasonable time and storage constraints.
In this paper, we extend the generalized Frank-Wolfe method of [36] to allow for inexact solutions to the linearized subproblem (Lin-OPT) at each iteration. Our inexact oracle model is flexible enough to allow for situations (a) where the subproblem is only solved to a specified relative error, and (b) where the method for solving the subproblem is randomized and fails with positive probability. Despite this flexible oracle model, we establish convergence guarantees that are comparable to those given by Zhao and Freund [36] for the method with exact oracle. While the high-level strategy of the analysis is broadly inspired by that of [36], our arguments are somewhat more involved.
Input: Problem (SC-OPT) where is a 2-self-concordant, -logarithmically-homogeneous barrier function
1.1 Questions addressed by this work
In this section, we discuss three ideas to further generalize the generalized Frank-Wolfe method [36]. These modifications help in addressing the following questions and we discuss them in detail in this section.
Motivating example.
A key motivating example for the use of an inexact oracle in the probabilistic setting is the case where is the indicator function for the set
(1.1) |
whose extreme points are the rank-1 PSD matrices. The problem of optimizing convex functions over the set prominently features in the study of memory-efficient techniques for semidefinite programming (see, e.g., [25, 34]). For such problems, generating an optimal solution to (Lin-OPT) is equivalent to determining the smallest eigenvalue-eigenvector pair of . For large-scale problems, it is typically not practical to compute such an eigenvalue-eigenvector pair to a very high accuracy at each iteration. Furthermore, scalable extreme eigenvalue-eigenvector algorithms (such as the Lanczos method with random initialization [14]) usually only give relative error guarantees, and are only guaranteed to succeed with probability at least for some .
Furthermore, for large-scale instances, it is essential to have a memory-efficient way to represent the decision variable. Recent memory-efficient techniques such as sketching [33, 34, 29, 30] or sampling [25] involve alternate low-memory representation of the decision variable and provide provable convergence guarantees of first-order methods that employ these low-memory representations. In Section 6, we briefly discuss how such memory-efficient techniques can be combined with our proposed method specifically when applied to a semidefinite relaxation of the problem of optimizing quadratic form over the sphere (see Section 1.2.1).
We now describe the setting for each question listed in this section.
Inexact Linear Minimization Oracle
An inexact minimization oracle that we consider here does not need to generate an optimal solution to (Lin-OPT), i.e., it does not need to solve the problem exactly in order to provide provable convergence guarantees for the algorithm. Before defining the oracle precisely, we look at how an inexact oracle changes the definition of quantities in Algorithm 1. We first define the approximate duality gap, , as
Here, is the value of the decision variable at iterate of our method, and is the approximate minimizer of (Lin-OPT). Note that, if is replaced by , an optimal solution to (Lin-OPT), then becomes equal to the Frank-Wolfe duality gap [13]. In Algorithm 1, is used to define the step length . Thus, in Algorithm 1, an exact oracle is needed to compute the update direction as well as the step length.
We consider a setting where we compute the update direction by computing an approximate minimizer, , to (Lin-OPT), and correspondingly defining and the step length which are functions of . For our iterative algorithm to be a descent method, we need . Thus, this gives us the criteria that the output of inexact oracle must satisfy. Furthermore, our analysis shows that if, at iterate , is large enough, we can ensure the algorithm makes progress, even if the approximation error in (Lin-OPT) is not small. Thus, we define an inexact oracle ILMO as follows:
Definition 1 (ILMO).
ILMO is an oracle that takes , (gradient of the objective function at ), and a nonnegative quantity as an input, and outputs a pair , where is the approximate duality gap, such that , and at least one of the following two conditions are satisfied:
-
C.1
(large gap condition) , where is the maximum variation in over its domain,
-
C.2
(-suboptimality condition) , where is an optimal solution to (Lin-OPT).
Note that, condition C.2 (-suboptimality) is equivalent to stating that the output of ILMO, , is an -optimal solution to (Lin-OPT), and thus, we do not need to solve (Lin-OPT) exactly. Furthermore, from our analysis, we observed that when is lower bounded by , need not be a -optimal solution. Thus, the definition of ILMO enforces that needs to be -optimal only when , which can lead to reduced computational complexity of solving (Lin-OPT). In Section 3, we propose approximate Frank-Wolfe algorithm (see Algorithm 2), that only requires access to ILMO, and furthermore, we establish convergence guarantees comparable to that of Algorithm 1 that requires exact oracle.
Inexact oracle with a nonzero probability of failure.
We consider an inexact oracle PLMO that has a nonzero probability of failure as defined below.
Definition 2 (PLMO).
The key difference between PLMO and ILMO (Definition 1) is the nonzero probability of failure of the oracle PLMO. Indeed, ILMO is a special case of PLMO when . In Section 5, we consider this probabilistic setting for the oracle. We provide convergence analysis of our proposed approximate generalized Frank-Wolfe algorithm, when the linear minimization oracle fails to satisfy both conditions C.1 (large gap) and C.2 (-suboptimality) with probability at most .
Implementing the inexact oracles
Our inexact oracle model is flexible enough that it can be implemented using an algorithm for solving (Lin-OPT) to either prespecified additive error or prespecified multiplicative error. The multiplicative error case is of interest when is the indicator function of a compact convex set so that for all feasible . In this case, it turns out (see Lemma 8) that the optimal value of (Lin-OPT), , is negative. By an algorithm that solves (Lin-OPT) to multiplicative error , we mean an algorithm that returns a feasible point that satisfies
This corresponds to an additive error of . The multiplicative error case is particularly challenging because the gradient of a self-concordant barrier, , grows without bound near the boundary of the domain of , meaning that is not uniformly bounded. Therefore, a small relative error for (Lin-OPT) (i.e., a small value of ) may still lead to large absolute errors. In Section 4 we show how the large gap condition (Condition C.1) in our oracle model allows us to overcome this issue.
1.2 Applications
Consider the -self-concordant, -logarithmically-homogeneous barrier function
(1.2) |
with and , where the linear map , maps the decision variable to an -dimensional vector . In this section, we review two applications (see Subsections 1.2.1 and 1.2.2) whose convex formulations involve minimizing functions of the form (1.2) over (1.1). The problems of this type are also often studied in the context of memory-efficient algorithms. One of the contributions of this paper is to show that, for these applications, we can combine memory-efficient representations [25] of the decision variable with the first-order algorithm proposed in this paper.
Historically, self-concordant barrier functions were often studied because they allow for an affine-invariant analysis of Newton’s method [19]. Recently, problems with self-concordant barriers in the objective are also studied in the context of machine learning, often as a loss function [2, 22, 16]. The function (1.2) appears in several applications such as maximum-likelihood quantum state tomography [11], portfolio optimization [31, 4, 35], and maximum likelihood Poisson inverse problems [10, 20, 21] in a variety of imaging applications such as night vision, low-light imaging. Furthermore, convex formulations of problems such as -optimal design [5, 1] and learning sparse Gaussian Markov Random Fields [26, 23] also involve optimizing a 2-self-concordant, -logarithmically-homogeneous barrier function (, with ). The functions and are also commonly used self-concordant barriers for the PSD cone and , respectively.
1.2.1 Optimizing quadratic form over the sphere
Consider the problem
(GMean) |
where for . Many applications such as approximating the permanent of PSD matrices or portfolio optimization take the form of maximizing quadratic forms over . By normalizing the objective of this problem, we get (GMean), whose objective is the geometric mean of the quadratic forms. Yuan and Parrilo [32, Theorem 7.1] shows that when , (GMean) is NP-hard. A natural semidefinite relaxation of (GMean) is
(1.3) |
Yuan and Parrilo [32] show that solving the SDP relaxation (1.3) and appropriately rounding the solution to have rank one, gives a constant factor approximation algorithm for (GMean). For given in (1.1), we have an equivalent problem
(SC-GMean) |
where the objective is a 2-self-concordant, -logarithmically-homogeneous barrier function, and (SC-GMean) is equivalent to (1.3) up to the scaling of the objective. So, an optimal solution of (SC-GMean) is also an optimal solution of (1.3). Yuan and Parrilo [32, Theorem 1.3] show that if is optimal for (1.3) and then feasible for (GMean) and within a factor of of the optimal value, where , is the Euler-Mascheroni constant, is the digamma function.
1.2.2 Maximum-likelihood Quantum State Tomography (ML-QST)
The problem with structure similar to (SC-GMean) also appears in Quantum State Tomography (QST). QST is the process of estimating an unknown quantum state from given measurement data. We consider that a measurement is a random variable which takes the values in the set with probability distribution , where each is PSD complex matrix such that , , where is defined in (1.1). Note that, if there are qubits, then . Maximum-likelihood estimation of quantum states [11, 12] involves finding the state (represented by ) that maximizes the probability of observing the measurement. Assume that we have measurements where for . The maximum-likelihood QST problem then takes the form , where for . Several results for maximum likelihood estimation of QST are provided in literature ([9, 3, 24]). The problem can be equivalently stated as
(ML-QST) |
Note that (ML-QST) is identical to (SC-GMean) and its objective is a 2-self-concordant, -logarithmically-homogeneous barrier.
1.3 Contributions
The focus of this paper is to provide a framework to solve a minimization problem of the form (SC-OPT), where the objective function is a self-concordant, logarithmically-homogeneous barrier function. We provide an algorithmic framework that addresses the three questions stated in Section 1.1. The main contributions of the paper are summarized below.
-
•
We provide an algorithmic framework, that generates an -optimal solution to (SC-OPT). The proposed algorithm (Algorithm 2) only requires access to an inexact oracle ILMO (Definition 1). We provide provable convergence guarantees (Lemmas 5 and 6) that are comparable to guarantees given by Zhao and Freund [36] whose method requires access to an exact oracle.
-
•
We consider the setting where the inexact oracle can provide an approximate solution to (Lin-OPT) with either additive or multiplicative error bound. We show how to implement the oracle that satisfies the output criteria of ILMO. In this scenario, the convergence guarantees given by Lemmas 5 and 6 for Algorithm 2 still hold true.
-
•
We next consider a setting where the linear minimization oracle fails to satisfy both conditions C.1 (large gap) and C.2 (-suboptimality) with probability at most . In this setting, we show that we can bound, with probability at least , the number of iterations of Algorithm 2 required to converge to an -optimal solution of (SC-OPT) (see Lemmas 12 and 13).
-
•
We apply our algorithmic framework specifically to (SC-GMean) and provide the computational complexity of our proposed method. We also show that our method can be combined with a memory-efficient technique given by Shinde et al. [25], so that the method uses at most memory (apart from the memory used to store the input parameters). We also implement our method as well as Algorithm 1, and use these algorithms to generate an -optimal solution to randomly generated instances of (SC-GMean). We show that, for these instances, the behavior of our method (in terms of change in optimality gap with each iteration) is similar to that of Algorithm 1. We also show that the theoretical upper bound on the number of iterations to converge to an -optimal solution is quite conservative, and in practice, the algorithms generate the desired solution much quicker.
1.4 Brief Review of First- and Second-Order Methods for (SC-OPT)
It is easy to see that interior-point methods [18, 17] can be used to solve composite problems of type (SC-OPT). In fact, a standard technique while using IPMs for such problems first eliminates the nonsmooth term and additionally uses (self-concordant) barrier functions for inequality constraints. The method then solves a sequence of problems to generate an approximate solution to (SC-OPT). Such methods often result in the loss of underlying structures (for example, sparsity) of the input problem. To overcome this issue, proximal gradient-based methods [28, 26, 27] directly solve the nonsmooth problem (SC-OPT) and provide local convergence guarantees (, where ) for (SC-OPT). Liu et al. [15] use projected Newton method to solve , where is self-concordant, such that the projected Newton direction is computed in each iteration of the method using an adaptive Frank-Wolfe method to solve the subproblem. They show that the outer loop (projected Newton) and the inner loop (Frank-Wolfe) have the complexity of and respectively, where , for some constants , where satisfy the conditions outlined in [15, Theorem 4.2].
For large-scale applications of (SC-OPT), Frank-Wolfe and other first-order methods are often preferred since, at each iteration either an exact or approximate solution is generated to the linear minimization problem (Lin-OPT) which can lead to low-computational complexity per iteration. However, the analysis of such methods usually depends on the assumption that has a Lipschitz continuous gradient [13], which a self-concordant function does not satisfy. There are algorithms (for example [6]) that provide new policies to compute step length at each iteration of Frank-Wolfe and provide convergence guarantees for the modified algorithm. However, these guarantees depend on the Lipschitz constant of the objective on a sublevel set, which might not be easy to compute and in some cases may not even be bounded.
1.5 Notations and Outline
Notations.
We use small alphabets (for instance, ) to denote a vector variables, and big alphabets (for example, ) to denote matrix variables. We use superscripts to denote values of parameters and variables at -th iteration of the algorithm. The term denotes the matrix inner product. The notations and are used to denote the gradient and the Hessian respectively of a twice differentiable function . The linear map maps a symmetric matrix to a -dimensional space and is its adjoint. The notation has the usual complexity interpretation.
Outline.
In Section 2, we review the generalized Frank-Wolfe algorithm and key properties of a 2-self-concordant, -logarithmically-homogeneous function. We also define the terminology used in the rest of the paper. In Section 3, we propose an approximate generalized Frank-Wolfe algorithm (Algorithm 2) that generates an -optimal solution to (SC-OPT), and we provide convergence analysis of the algorithm. In Section 4, we show how an oracle that generates an approximate solution with relative error can be used with our proposed method (Algorithm 2). In Section 5, we propose a method (Algorithm 2) that, with high probability, generates an -optimal solution to (SC-OPT) when the oracle that solves the linear subproblem has a nonzero probability of failure. In Section 6, we apply our method specifically to (SC-GMean) and provide the convergence analysis of the method. We also show that our method can be combined with memory-efficient techniques from literature so that it can implemented in memory. Finally, in Section 7, we provide computational results of applying Algorithm 2 to random instances of (SC-GMean).
2 Preliminaries
In this section, we review key properties of self-concordant, -logarithmically-homogeneous barrier functions as well as other quantities that are useful in our analysis. For a detailed review of self-concordant functions, refer to Nesterov and Nemirovskii [19].
Definition 3.
[19] Let be a regular cone, and let be a thrice-continuously differentiable, strictly convex function defined on . Define the third-order directional derivative . Then is a -self-concordant, -logarithmically-homogeneous barrier on if
-
1.
along every sequence that converges to a boundary point of ,
-
2.
for all and , and
-
3.
for all and .
Properties of a -self concordant, -logarithmically homogeneous barrier function .
Let be a proper cone and let be a -self-concordant, -logarithmically-homogeneous barrier function on . For any , let denote the Hessian of at . Define the (semi)norm for any . Also, define which is often referred to as the unit Dikin ball. Define a univariate function and its Fenchel conjugate as follows:
(2.1) | ||||
(2.2) |
Proposition 1 (Proposition 2.3.4, Corollary 2.3.1,2.3.3 [19]).
Let be a proper cone and let be a -self-concordant, -logarithmically-homogeneous barrier function on . If then
(2.3) | ||||
(2.4) | ||||
(2.5) | ||||
(2.6) | ||||
(2.7) | ||||
(2.8) |
These properties will be used later in the proofs.
2.1 Terminology
We now define a few quantities that are used in the rest of the paper.
- •
-
•
Approximate Frank-Wolfe duality gap, : For any two feasible points , to (SC-OPT), we have
(2.10) Note: The duality gap, , is a function of the current iterate , whereas the approximate duality gap, , is a function of the current iterate , and any feasible point to (SC-OPT). However, for the ease of notation, we denote by , and by in the rest of the paper when and are clear from the context.
- •
-
•
Maximum variation of over its domain, : We define as as
(2.12) -
•
Distance, : For any two feasible points and to (SC-OPT), we define
(2.13) where is the (semi)norm defined by the Hessian of at . When and are clear from the context, we denote by .
The following standard result shows that the optimality gap is bounded by the duality gap.
Fact 1.
For any feasible point to (SC-OPT), we have .
3 Approximate Generalized Frank-Wolfe algorithm
Consider the composite optimization problem (SC-OPT), where is a -logarithmically-homogeneous, -self-concordant barrier function on , is a proper cone, and is a proper, closed, convex and possibly nonsmooth function with a closed, compact, nonempty set. Zhao and Freund [36] propose a Generalized Frank-Wolfe algorithm (Algorithm 1) that requires an exact solution to the subproblem (Lin-OPT) at each iteration. In their algorithm, the knowledge of the exact minimizer is necessary to compute the update step, , as well as the step length (since the step length depends on the duality gap ). Furthermore, the analysis of the algorithm also depends on the exact minimizer of the subproblem (Lin-OPT). We, however, propose an ‘approximate’ generalized Frank-Wolfe algorithm. The term approximate in the name of our algorithm comes from the fact that we do not require the linear minimization subproblem (Lin-OPT) to be solved exactly at each iteration. The framework of our method is given in Algorithm 2. We provide a detailed explanation of Steps 5 and 6 (setting the value of and ILMO respectively) below.
Input: Problem (SC-OPT), where is a 2-self-concordant, -logarithmically-homogeneous barrier function, : suboptimality parameter, scheduled strategy S.1 or adaptive strategy S.2 for setting the value of
Output: , such that an -optimal solution to (SC-OPT)
Note that any -self concordant function can be converted to a -self concordant function by scaling the objective. We define this property more formally in the following proposition.
Proposition 2.
Let be an -self concordant, -logarithmically homogeneous function. Define a function and parameter such that
-
1.
,
-
2.
.
Then is a -self concordant, -logarithmically homogeneous function.
Proposition 2 tells us that, by rescaling, we can always reduce to the regime of functions that are 2-self-concordant (often called standard self-concordant). However, if we are minimizing such a function, rescaling affects the optimal value, and hence affects approximation error guarantees. Thus, even though the input to Algorithm 2 is specifically a problem with 2-self-concordant barrier function in the objective, by replacing by in Step 7 of the algorithm, we can generate an -optimal solution to (SC-OPT), when is an -self-concordant, -logarithmically-homogeneous barrier function.
Choosing .
We propose two different strategies to choose the value of in Step 5 of the algorithm.
-
S.1
(Scheduled strategy) is any positive sequence converging to zero. It follows that, given any , there exists some positive integer such that for all .
-
S.2
(Adaptive strategy) .
Remark 1.
The main idea behind adaptive strategy S.2 is to allow the target accuracy of ILMO to adapt to the progress of the algorithm (in terms of the smallest approximate duality gap we have encountered so far). This allows the ILMO to solve to low accuracy until significant progress has been made in terms of approximate duality gap. Since we are working with the approximate duality gap, and not the actual duality gap, it is possible that before the algorithm has reached the desired optimality gap. We use the additive quantity to ensure that we can choose even when for some .
ILMO in Algorithm 2.
The main difference between generalized Frank-Wolfe [36] and Algorithm 2, is the definition of the oracle. The generalized Frank-Wolfe algorithm (Algorithm 1) requires LMO to generate an optimal solution to the subproblem (Lin-OPT) in order to generate update direction and step length, as well as to prove its convergence guarantees. Whereas in Algorithm 2, ILMO is an oracle whose output only needs to satisfy and at least one the two conditions C.1 (large gap) and C.2 (-suboptimality) (see Definition 1).
When condition C.2 (-suboptimality) is satisfied by the output of ILMO, we have the following relationship between the duality gap and the approximate duality gap .
Lemma 1 (Relationship between and ).
Proof.
If condition C.2 (-suboptimality) holds, then Algorithm 2 generates such that
(3.2) |
From the definition of (see (2.10)), we have
where the second equality follows from the second inequality in (3.2). We also have
where the inequality follows from the first inequality in (3.2). Combining the two results completes the proof. ∎
Computing .
We now explain how the step-size, , is selected in Algorithm 2. Substituting and (as given by Algorithm 2) in (2.3), we have
(3.3) |
Furthermore, since is convex,
(3.4) |
Now, adding (3.3) and (3.4), we have
(3.5) |
where the last inequality follows from (2.10). We can now compute an optimal step length at every iteration by optimizing the right hand side of (3.5) for to get
(3.6) |
Zhao and Freund [36] use a similar technique to compute based on . The gap requires the knowledge of an optimal solution to the subproblem (Lin-OPT), whereas (3.6) uses an approximate solution to (Lin-OPT).
Behavior of .
In the following proposition, we give lower bounds on the amount by which decreases at each iteration of Algorithm 2. Note that this result is independent of the sequence .
Proposition 3.
Let (SC-OPT) (with a 2-self-concordant and -logarithmically-homogeneous barrier function) be the input to Algorithm 2. Let the quantities and be as defined in (2.12) and (2.10) respectively. For any in Algorithm 2,
-
1.
if , we have
(3.7) -
2.
if , we have
(3.8)
Consequently, is a monotonically non-increasing sequence.
Proof.
The proof of this result is similar to the proof given in [36]. The difference arises due to the slightly different value of the step length that we have used. We give the proof here for completeness.
Proving (3.7).
Rewriting (2.10), we have
(3.9) |
since for all . Thus,
(3.10) |
Now, using the definition of and (2.4), we can write
(3.11) |
Combining (3.10) and (3.11), we bound as
(3.12) |
where (i) uses the definition of (2.12), (ii) uses the fact that and (iii) uses (2.8). Now, since , it follows that , and so, substituting in (3.5) and rearranging gives
(3.13) |
The relationship between the optimality gap at consecutive iterations is given as
(3.14) |
where (i) follows from (A.1) and the monotonicity of . Since , we have . Now, [36, Proposition 2.1] states that for all , and so,
(3.15) |
where the second inequality follows since , thus proving (3.7).
Proving (3.8).
First, consider the case where , which indicates that , which can be rearranged as
(3.16) |
From (3.5), we see that
(3.17) |
Combining (3.16), (3.17), and the fact that [36, Proposition 2.2], we have
(3.18) |
Furthermore,
(3.19) |
where (i) uses the fact that , and (ii) uses the inequality . This completes the proof of (3.8) for the case . Now, if , then , and from (3.13), we have
(3.20) |
where (i) uses (A.1) and the monotonicity of , (ii) uses the fact that and [36, Proposition 2.1], and (iii) uses the fact that . This completes the proof of (3.8).
Proposition 4.
If and Algorithm 2 stops (i.e., it reaches Step 8) at iteration , we have .
3.1 Convergence Analysis of Algorithm 2
Before providing the convergence analysis of Algorithm 2, we introduce a few key concepts and relationships that are used in the analysis.
3.1.1 Partitioning the iterates of Algorithm 2
Our analysis depends on decomposing the sequence of iterates generated by Algorithm 2 into three subsequences depending on the value of . This partitioning depends on a parameter , which controls which iterates fall into which subsequence, as well as the logarithmic homogeneity parameter and the maximum variation parameter . Note that the definition makes sense for any pair of sequences and of points feasible for (SC-OPT), irrespective of how they are generated.
Definition 4.
Let and be any sequence of feasible points for (SC-OPT) such that . Given , define three subsequences of as follows:
-
•
The -subsequence consists of those for which .
-
•
The -subsequence consists of those for which and .
-
•
The -subsequence consists of those for which and .
Remark 2.
The partitioning of the iterates as in Definition 4 refines the partitioning used in [36] which considers the situation where the linearized subproblems are solved exactly at each iteration. In that situation, for all and the partitioning in Definition 4 (with ) only yields the - and -subsequences. One key idea in our analysis is to distinguish the iterates where behaves like by providing an upper bound on (the -subsequence), and the iterates where does not give such an upper bound on the optimality gap (the -subsequence). For the -subsequence, our analysis broadly follows the approach in [36], showing progress in both the optimality gap and the approximate duality gap. For the -subsequence, we can still ensure progress in terms of the approximate duality gap. If the algorithm is run for sufficiently many iterations, either the -subsequence, or the -subsequence, will be long enough to ensure the stopping criterion is reached.
Given a positive integer , define
(3.21) | ||||
(3.22) | ||||
(3.23) |
Note that , since every iterate falls into exactly one of these three sequences.
When the iterates and are generated by Algorithm 2, then the optimality gap and approximate duality gap for different subsequences satisfy certain inequalities. These allow us to analyse the progress of the algorithm in different ways along the different subsequences. The following result summarizes the key inequalities for each subsequence that will be used in our later analysis.
Lemma 2.
3.1.2 Bounding the length of the -subsequence
We now provide an upper bound on the length of the -subsequence in terms of the initial optimality gap .
Lemma 3 (Bound on ).
3.1.3 Controlling the optimality gap and approximate duality gap along the -subsequence
In this section, we will show that in any sufficiently large interval of the -subsequence, there must be an iterate at which the approximate duality gap is small. To do this, we use the following result, a slight modification of [36, Proposition 2.4]. While [36, Proposition 2.4] bounds the minimum over the first elements in the sequence , i.e., , Proposition 5 provides a bound on the minimum over any subsequence of consecutive elements with .
Proposition 5.
Let . Consider two nonnegative sequences and that satisfy
-
•
for all and
-
•
for all .
If , then and
(3.27) |
If, in addition, , then and
(3.28) |
Proof.
First, suppose that for some . In this case for all and for all , so the bounds hold trivially. As such, assume that is chosen so that , and hence for .
The first recurrence implies that for all . Dividing both sides of the first recurrence by the positive quantity gives
where the inequalities follow from the assumption that and the fact that for all .
It follows that and so . Since this inequality is independent of , we can minimize over to obtain the best bound. By setting and using , we obtain for all . Furthermore, if , we have that .
Now we establish the bound on . Let . Since the minimum is bounded above by the average, we have that
Using the bound (valid since ), we have that
(3.29) |
We are free to choose to minimize the bound on the right hand side of (3.29). Let . A minimum occurs at if , otherwise it occurs at . In the first case, we obtain
In the second case we obtain
Taking the square root of both sides gives the desired bound.
The following lemma shows how the optimality gap and the approximate duality gap reduce for iterates in the -subsequence. The implication of this result is that if the -subsequence is long enough, both of these quantities will eventually become small.
Lemma 4.
3.2 Analysis of Algorithm 2
In this section, we provide a bound on the number of iterations of Algorithm 2 required to achieve the stopping criterion and return an -optimal solution (i) when is chosen according to scheduled strategy S.1 (see Lemma 5), and (ii) when is chosen according to adaptive strategy S.2 (see Lemma 6).
Lemma 5.
Proof.
Let and let the first iterations of Algorithm 2 be partitioned into the -, -, and -subsequences according to Definition 4 with . Then .
From Lemma 3, we know that , and so it follows that . We now consider two cases: (i) , and (ii) . In the first case, the -subseqence will be long enough to establish convergence. In the second case, the -subsequence will be long enough to ensure convergence.
Case 1: . In this case we will show that it suffices to choose where
Since it follows that , as required. We now establish bounds on , and .
Case 2: . In this case we will show that it suffices to choose .
Since , we have that . Consider the sequence for . Since , from Lemma 2 we have that for each .
Since , we have (from the definition of ). We also have . Finally, from Proposition 4 we have that , completing the argument. ∎
The following result (Lemma 6) provides an upper bound on the number of iterations of Algorithm 2 until it stops when the value of is set according to adaptive strategy S.2. The main difference between the proofs of Lemma 5 and Lemma 6 is in the analysis of the -subsequence. In Lemma 6, since is chosen in terms of the smallest value of the approximate duality gap we have encountered so far, we can ensure that the approximate duality gap reduces geometrically along the -subsequence.
Lemma 6.
Proof.
Let and let the first iterations of Algorithm 2 be partitioned into the -, -, and -subsequences according to Definition 4 with . Then . We know that (from Lemma 3). It follows that .
We now consider two cases: (i) , (ii) .
Case 1: . In this case we will show that it suffices to choose where
Since , we have . It follows that , as required. We now establish bounds on and .
Case 2: . In this case, we will show that it suffices to choose .
Consider the sequence for . Since , from Lemma 2 we have that for . From our choice of , we know that for ,
Since and, for each we have , it follows that
where the second inequality holds because for all . Recall that and that by assumption. Then
by the definition of . Moreover
Finally, from Proposition 4 we have that , as required. ∎
4 Implementing ILMO
In this section we discuss how to implement the oracle ILMO given that we have access to a method to approximately solve (Lin-OPT) with either prescribed additive error or (under additional assumptions) prescribed relative error.
If we have a method to solve (Lin-OPT) to a prescribed additive error, then we can use it to simply guarantee that condition C.2 (-suboptimality) is satisfied at each iteration. However, this does not automatically guarantee that the associated near-minimizer for (Lin-OPT) satisfies the additional condition . In Section 4.1, we briefly discuss how to ensure that this condition is also satisfied.
The oracle ILMO, with its two possible exit conditions ( C.1 (large gap) and C.2 (-suboptimality)), allows more implementation flexibility than just solving (Lin-OPT) to a prescribed additive error. In Section 4.2, we restrict to problems for which is the indicator function of a compact convex set. In this case, we will show that if we only have a method to solve (Lin-OPT) to prescribed relative error, it can still be used to implement ILMO. This is less straightforward since, in general, the optimal value of (Lin-OPT) is not uniformly bounded. Moreover, this is exactly the scenario of interest in the motivating example discussed in Section 1.1.
4.1 Ensuring the output of ILMO satisfies
The definition of ILMO requires that the point produced by ILMO is such that is nonnegative. This is clearly satisfied whenever IMLO returns a point such that . However, when the prescribed additive error for (Lin-OPT) is small, it could be the case that condition C.2 (-suboptimality) is satisfied but . In this case, it turns out that we can just return as a valid output of ILMO.
Lemma 7.
Proof.
We note that if ILMO returns then either Algorithm 2 terminates at line 7 or we will have and , so that .
4.2 ILMO using an approximate minimizer of with prescribed relative error
In this section, we assume that we have access to a method that can compute an approximate minimizer of with prescribed relative error. We further assume that is an indicator function, so that . This assumption ensures that the optimal value of (Lin-OPT) is related to , the parameter of logarithmic homogeneity of .
Lemma 8.
Proof.
The following result shows that a method to solve (Lin-OPT) with a multiplicative error guarantee is enough to implement ILMO in Algorithm 2. Crucially, the prescribed multiplicative error required is uniformly bounded, independent of the current iterate .
Lemma 9.
Proof.
We prove the result by considering two cases for the value of . From Lemma 8 we know that . The argument is based on two cases, depending on whether is large or small.
Case I: .
Case II:
5 Probabilistic LMO
Algorithm 2 uses an oracle ILMO that generates a pair such that the output satisfies at least one of the two conditions C.1 (large gap) and C.2 (-suboptimality). In this section, we consider a setting where we only have access to a probabilistic oracle PLMO, that fails with probability at most to generate the output as required by the algorithm (see Definition 2). We first provide an algorithm that uses PLMO to generate an -optimal solution to (SC-OPT) with any desired probability of success. We then analyze the convergence of the algorithm for both strategies S.1 (scheduled) and S.2 (adaptive) of choosing the sequence .
Framework of the algorithm.
Algorithm 3 provides the framework of the algorithm that uses PLMO. There are two main differences between Algorithm 2 and Algorithm 3. The first is that we use the probablistic oracle PLMO (see Definition 2) in Step 6 instead of ILMO, i.e., with probability at most , the pair generated in the Step 6 of Algorithm 3 satisfies neither condition C.1 (large gap) nor condition C.2 (-suboptimality), but does satisfy . The second difference is in the stopping criteria of the algorithm. While Algorithm 2 stops when and , Algorithm 3 stops when the bounds and are satisfied times. We will see (in Proposition 6) that the parameter controls the probability that the optimality gap is small when the algorithm stops.
Input: Problem (SC-OPT), where is a 2-self-concordant, -logarithmically-homogeneous barrier function, : suboptimality parameter, scheduled strategy S.1 or adaptive strategy S.2 for setting the value of
Output: : an -optimal solution to (SC-OPT) with probability at least
It is sometimes fruitful to think of the iterates of Algorithm 3 as being iterates of Algorithm 2 with respect to an alternative (random) sequence . To see how this is the case, let be a non-negative sequence and let be a sequence of independent random variables where with probability and with probability and for all . Then we can think of the iterates of Algorithm 3 with respect to as the iterates of Algorithm 2 with respect to the random sequence where for all . The multiplier allows ILMO to model the behaviour of PLMO. Indeed when the multiplier , condition C.2 (-suboptimality) holds trivially with respect to , effectively enforcing no additional constraint on the output of the oracle, apart from .
This viewpoint allows us to leverage existing results for the iterates of Algorithm 2 in the context of Algorithm 3. In particular, any results that are independent of the sequence , such as Lemmas 3 and 4, continue to hold for Algorithm 3. Other results that depend on (such as the bound (3.26) in Lemma 2) remain valid as long as we replace with the random variable . In light of this observation, the following result summarizes the key inequalities that hold for the iterates of Algorithm 3.
Lemma 10.
Let and be sequences of feasible points for (SC-OPT) generated by Algorithm 3. Partition the iterates into the -, -, and - subsequences with respect to some as in Definition 4. Then
-
•
-
•
If then .
-
•
If then .
-
•
The random variables (indexed by ) are independent, and are each non-negative with probability at least .
The following result shows that the parameter can be used to control the probability that Algorithm 3 successfully returns a near-optimal point.
Proposition 6.
If and Algorithm 3 (with parameter ) stops after iterations, we have with probability at least .
Proof.
Let be the iterations of Algorithm 3 at which and . If Algorithm 3 stops, it does so after iterations. Since , we have that at each . By the definition of PLMO, for all . Therefore, is monotonically non-increasing. As such
Suppose that there is some such that condition C.2 (-suboptimality) holds at iterate . Then
where (i) follows from Lemma 1. Combining these inequalities gives that .
Finally, note that the probability that condition C.2 (-suboptimality) is satisfied for at least one out of iterates is at least , so it follows that with probability at least . ∎
5.1 Convergence Analysis of Algorithm 3
The convergence analysis of Algorithm 3, like that of Algorithm 2, also makes use of the partition of the iterates into the -, -, and -subsequences, as in Definition 4. If the first iterates of Algorithm 3 are divided into sequences the -, -, and -subsequences, let , , and (defined in (3.21)) respectively denote the length of each subsequence.
Our analysis of Algorithm 3 follows a broadly similar approach to the analysis of Algorithm 2. The key difference is that we need to bound the time it takes to achieve the stopping criterion for the th time, and that the inequality that allows us to control the approximate duality gap along the -subsequence only holds with probability at least independently on each iteration.
The main technical tool needed to deal with these issues is a standard tail bound on the number of successes in a sequence of independent (but not necessarily identically distributed) Bernoulli trials.
Lemma 11.
Let be a positive integer, let and let . Let be a positive integer such that . Let be a collection of independent random variables such that takes value with probability and takes value with probability , and assume that for all . Then the probability that at least of the variables takes value is at least .
Proof.
The number of variables that take value is , which is a sum of independent random variables, each taking values in . The mean is . By Hoeffding’s inequality
(5.1) |
Let . From our assumption that we know that and that . It follows that and that . Therefore
(5.2) |
∎
We are now ready to provide bound on number of iterations performed by Algorithm 3 until the stopping criteria (defined in Step 7 of the algorithm) is satisfied. Lemmas 12 and 13 provide a bound on number of iterations performed by Algorithm 3 when is set according to scheduled strategy S.1 and adaptive strategy S.2 respectively.
Lemma 12.
Proof.
For brevity of notation, define . Let and let the first iterations of Algorithm 2 be partitioned into the - - and - subsequences according to Definition 4 with . Then, .
From Lemma 10, we know that , and so it follows that . We now consider two cases: (i) , and (ii) . In the first case, the -subseqence will be long enough to establish that the algorithm reaches the stopping criterion. In the second case, the -subsequence will be long enough to ensure that the algorithm reaches the stopping criterion with probability at least .
Case 1: . For , let
Since it follows that , for all , as required. We now establish bounds on , for , showing that at each of these iterations, the stopping criterion is satisfied.
-
•
Since , it follows that , and so that (by the definition of ).
-
•
By the definition of , from Lemma 10 (with and ) we have that and so
It follows that Algorithm 3 terminates at iteration .
Case 2: . Since , we have that . We will show that among the iterates , the conditions and hold at least times with probability at least . This ensures that Algorithm 3 stops with probability at least by iterate .
Since , it follows (from the definition of ) that for . The subsequence for , has length at least . From Lemma 10, we know that holds with probability at least , independently for each . From Lemma 11, we can conclude that, with probability at least , we have that holds at least times over the iterates .
∎
We now consider the case when is set via the adaptive strategy S.2. Recall that in the deterministic case, the main idea was that the approximate duality gap decreases geometrically along the -subsequence. In the random setting, this does not quite hold, because the bound fails to hold with probability at most . However, if the -subsequence is long enough, then this relationship does hold along a sufficiently long subsequence of the -subsequence, so that the stopping criterion is reached (with any desired probability).
Lemma 13.
Proof.
Let where . Let the first iterations of Algorithm 2 be partitioned into the -, -, and -subsequences according to Definition 4 with . Then . We know that (from 10). It follows that . We now consider two cases: (i) , (ii) .
Case 1: . If then . For , let
Since , it follows that , for all , as required. We now establish bounds on and , for , showing that at each of these iterations, the stopping criterion is satisfied.
- •
- •
Case 2: . It is convenient to define a further subsequence of the -subsequence. Let the -subsequence consist of those for which for some and . The -subsequence consists of those indices from the -subseqence at which the PLMO satisfies condition C.2 (-suboptimality). From Lemma 10, we know that the -subsequence can be obtained from the -subsequence by (independently) keeping each element of the -subsequence with some probability that is at least . Let denote the (random) number of elements of that are in the -subsequence. From Lemma 11, we know that if then with probability at least .
By the same argument as in Case 2 of Lemma 6 (but replacing with ) we have that
for and that
for . Then, if , we have that and that . In particular, the stopping criterion of Algorithm 3 holds at iterates . As long as , the stopping criterion of Algorithm 3 is achieved times. We have seen that, with probability at least , we have that and so , as required. ∎
6 Application of Algorithm 3 to (SC-GMean)
In this section, we consider the application of Algorithm 3 to generate a near-optimal solution to (SC-GMean). The aim of this section is to work through this example as it covers the study of motivating example introduced in Section 1.1. We show how to implement PLMO using Lanczos method in Section 6.1, followed by stating the computational complexity of Algorithm 3 in Section 6.2, where we explicitly define all the parameters of the problem. Finally, in Section 6.3, we show that Algorithm 3 can be made memory-efficient (requiring memory) using memory-efficient techniques proposed by Shinde et al. [25]. We first restate this problem as the composite optimization problem
(SC-GMean-Cmp) |
where is the indicator function of and is a -logarithmically-homogeneous, 2-self-concordant function defined on . In the case of (SC-GMean-Cmp), the linear subproblem (Lin-OPT) takes the form
(6.1) |
which is equivalent to solving an eigenvalue problem. Indeed, if is a unit norm eigenvector corresponding to the smallest eigenvalue of , then is an optimal solution to the linear subproblem (6.1). Now, to generate the update direction in Step 6 of Algorithm 3, we need to find an approximate solution to the linear subproblem or equivalently an approximate solution to the corresponding extreme eigenvalue problem. In this paper, we consider using the Lanczos method (with random initialization) to generate such an approximate solution. In the next subsection, we review an existing convergence result for the Lanczos method and show how it can ultimately be used to implement PLMO. Furthermore, in Lemma 15, we provide a bound on number of iterations of the Lanczos method needed to implement PLMO. Finally, in Subsection 6.2, we provide the ‘inner’ iteration complexity, i.e., the computational complexity of each iteration of Algorithm 3 applied to (SC-GMean-Cmp), as well as a bound on the total computational complexity of the algorithm.
6.1 Implementing PLMO in Algorithm 3 when applied to (SC-GMean-Cmp)
The Lanczos method (see, e.g., [14]) is an iterative method that can be used to determine the extreme eigenvalues and eigenvectors of a Hermitian matrix. A typical convergence result for the method, when initialized with a uniform random point on the unit sphere, is given in Lemma 14.
Lemma 14 (Convergence of Lanczos method [14]).
Let and . For , let denote the largest absolute eigenvalue of . The Lanczos method, after at most iterations, generates a unit vector , that satisfies
(6.2) |
with probability at least . The computational complexity of the method is , where is the matrix-vector multiplication complexity for the matrix .
Algorithm 4 shows how to use the Lanczos method as the basis for an implementation of PLMO, to generate the update direction in Algorithm 3. The following result establishes that Algorithm 4 is indeed a valid implementation of PLMO.
Input: , , Output: (, )
Lemma 15.
Algorithm 4 implements PLMO correctly for the problem (SC-GMean-Cmp).
Proof.
For , we can rewrite (6.2) as . Let (set according to Lemma 9). From Lemma 14, we see that after iterations, the Lanczos method generates a unit vector that satisfies, with probability at least ,
(6.3) |
where the last equality follows from the definition of .
Furthermore, since , and , we can rewrite (6.3) as
(6.4) |
where the last equality follows since . So, from Lemma 9, we see that, if (6.4) holds, then at least one of the two conditions C.1 (large gap) and C.2 (-suboptimality) is satisfied. And since (6.4) holds with probability at least , we have that after at most iterations of Lanczos method, the generated pair satisfy at least one of the two conditions C.1 (large gap) and C.2 (-suboptimality) with probability at least .
We are now ready to provide the convergence analysis and the total computational complexity of Algorithm 3 when applied to (SC-GMean-Cmp) in the next subsection.
6.2 Computational complexity of Algorithm 3 for (SC-GMean)
In this section, we provide the computational complexity of Algorithm 3 when applied to (SC-GMean). The total computational complexity of Algorithm 3 consists of the ‘inner’ iteration complexity, i.e., the computational complexity of each iteration of Algorithm 3 multiplied by the number of iterations performed by the algorithm. The two most computationally expensive steps in Algorithm 3 are (i) Step 10 (computing ), and (ii) Step 6, implementing PLMO, whose computational complexity can be determined using Lemmas 14 and 15. We elaborate on this complexity result in the proof of Lemma 16.
Lemmas 12 and 13 (from Section 5) provide a bound on the number of iterations performed by Algorithm 3 when the scheduled strategy S.1 and the adaptive strategy S.2 are used, respectively. Note that the bounds given in these lemmas depend on the parameters , and . We first compute a bound on these quantities and finally provide the total computational complexity of Algorithm 3 in Lemma 16.
Values of parameters , and .
Since in (SC-GMean-Cmp) is a -logarithmically-homogeneous function, we have . Also, since is an indicator function for the set and Algorithm 3 generates feasible updates to the decision variable, we have .
Finally, we determine an upper bound on . The initial optimality gap, is defined as
(6.5) |
where the last equality holds since and . Now let , so that . To generate a lower bound on , we use the knowledge of the dual function. The dual of (SC-GMean) is
(6.6) |
Let and , for . Note that, , is a feasible solution to the dual problem (6.6). Therefore,
(6.7) |
The optimality gap can now be bounded by
(6.8) |
Lemma 16.
Let (SC-GMean-Cmp) be solved using Algorithm 3 with parameters , and , such that PLMO fails with probability at most .
- 1.
-
2.
Inner iteration complexity when scheduled strategy S.1 is used. When scheduled strategy S.1 with and for all is used in Algorithm 3, and Lanczos method is used to implement PLMO in each iteration of the algorithm, then the total computational complexity of each iteration of Algorithm 3 is bounded by , where denotes the complexity of matrix-vector multiplication with .
- 3.
Proof.
Outer iteration complexity. When scheduled strategy S.1 with is used, we have that for all . Substituting , and values of parameters , and in Lemma 12, we see that, with probability at most , Algorithm 3 stops after at most
iterations. Thus, we have that is bounded by .
Inner iteration complexity when scheduled strategy S.1 is used. The two most expensive steps in Algorithm 3 are computing , and implementing PLMO. PLMO is implemented using Lanczos method whose computational complexity is given in Lemma 14. Assume that scheduled strategy S.1 with , and is used for all . From Lemma 14, we see that the computational complexity of the Lanczos method depends on . Note that, . If , then for all . Moreover, we have that and for all . Thus, , where . And so, if we store for all , then we see that the complexity of computing for all is bounded by , i.e., the complexity is bounded by the sum of complexities of matrix-vector multiplication with each . Finally, from Lemmas 14 and 15 (with ), we have that computational complexity of PLMO at iteration of the Algorithm 3 is bounded by .
Furthermore, recall that , and so, the complexity of computing is also bounded by . Thus, the total computational complexity of each iteration of Algorithm 3 is bounded by .
6.3 Making Algorithm 3 memory-efficient
One reason to use methods based on the Frank-Wolfe algorithm for semidefinite optimization problems with constraint , is that they typically can be modified to lead to approaches that do not require explicit storage of a PSD decision variable, and hence can be made memory-efficient. One approach to this is to represent the PSD decision variable by a Gaussian random vector , as suggested by Shinde et al. [25]. This is particularly useful in situations where the semidefinite program is used as the input to a rounding scheme that only requires such Gaussian samples. An example of this situation is discussed in Section 1.2.1, where an approximation algorithm for maximizing the product of quadratic forms on the unit sphere is obtained by solving (SC-GMean-Cmp) to get a matrix , and then sampling a Gaussian vector with covariance equal to and normalizing it, to get a feasible point for the original optimization problem on the sphere.
The Gaussian sampling-based approach takes advantage of two properties of a problem with the form
The first is the fact that the objective function only depends on via the lower dimensional vector with coordinates for . The second key property is the fact that the linearized subproblem is an extreme eigenvalue problem. Therefore it always has a rank one optimal point, and we can implement PLMO so that the output is always rank one. The Gaussian-sampling based approach then involves making the following modifications to the classical Frank-Wolfe algorithm [25, Algorithm 2]:
-
•
Initializing the samples with , where is a covariance matrix for which the samples can be generated without explicitly storing , such as .
-
•
Modifying PLMO so that it returns the vector rather than the rank one matrix .
-
•
Updating the samples via where the are a sequence of i.i.d. random variables.
-
•
Updating via for
-
•
Expressing the linearized objective function in terms of via ,where is the partial derivative of with respect to the th coordinate.
The updates for the modified algorithm only depend on the parameters of the linear map via matrix-vector products with the . To run the algorithm, only (a vector of length ) and (a vector of length ) need to be stored. If the PLMO is implemented via an algorithm that only requires matrix-vector multiplications with (such as the power method), then this only requires and access to the via matrix-vector products, and only requires storage (in addition to the storage associated with the ).
When extending this approach to the setting of generalized Frank-Wolfe, the main additional issue is computing and (and consequently ). These computations can be done in a memory-efficient way because
where is the vector with coordinates for . Again, computing and only requires the vector as well as access to the via matrix-vector multiplications to compute .
In the specific case of (SC-GMean-Cmp), we have that is separable and so
Proposition 7.
Suppose that the power method [14] is used to implement PLMO in Algorithm 3. Then, Algorithm 3, combined with the Gaussian sampling-based representation of the decision variable, uses memory (in addition to the memory required to store the input parameters) to generate a zero-mean Gaussian random vector such that is an -optimal solution to the input problem (SC-GMean-Cmp).
7 Numerical Experiments
For our experiments, we implemented Algorithm 3 with two different strategies S.1 (scheduled) and S.2 (adaptive) for , as well as [36, Algorithm 1]. We generated two types of random instances of (SC-GMean), and reported the output of the three algorithms described below. The aim of our experiments was to check the performance of Algorithm 3, and investigate how conservative the theoretical results are against the practical observations.
Generating random instances of (SC-GMean-Cmp).
We generated two types of problem instances of (SC-GMean-Cmp), namely Diag, Rnd.
-
•
Diag problem instances. In Diag problem instances, were generated as follows: for , with other entries in the matrix set to 0. For Diag instances, the optimal solution has a scaled identity matrix on its principal diagonal and 0 everywhere else. Accordingly, we computed the optimality gap .
-
•
Rnd problem instances. Rnd problem instances were generated by generating parameter matrices ’s for such that where for each . For these instances, we use as an upper bound on as derived in Section 6.2.
Experimental Setup.
For our experiments, we implemented the following three algorithms: (a) Algorithm 3 with , i.e., for all (GFW-ApproxI), (b) Algorithm 2 with (GFW-ApproxII), and (c) [36, Algorithm 1] (GFW-Exact). Note that, for (SC-GMean-Cmp), , and since is an indicator function for and all three algorithms generate iterates feasible for the set . The values of other parameters were set as: , . Note that, Algorithm 3, and thus, GFW-ApproxI and GFW-ApproxII, stop after the bounds, and , are satisfied at least times, whereas GFW-Exact stops if [36]. Let be defined as
(7.1) |
which follows from Lemma 13, Lemma 16 and [36, Theorem 2.1] respectively. Note that, in (7.1), is an upper bound on the probability of failure of PLMO, and determines its computational complexity (see Lemma 16), while denotes an upper bound on the probability that the algorithm stops after at most iterations. Thus, if is as defined in (7.1), we have that with probability at least , GFW-ApproxI and GFW-ApproxII stops after at most iterations, and with probability 1, GFW-Exact stops after at most iterations. For our experiments, we set .
Using eigs in MATLAB.
For (SC-GMean-Cmp), the update direction generated at each iteration of Algorithm 3 is a rank-1 matrix such that is an approximate minimum eigenvector of . The command eigs in MATLAB generates an approximate minimum eigenvalue-eigenvector pair for the matrix such that is a unit vector and , where tol is a user-defined value. In case of GFW-ApproxI, we set , which is the default value, at each iteration of the algorithm. While at each iteration of GFW-ApproxII, we set . Note that, for (SC-GMean-Cmp), the optimal objective function value of the subproblem (Lin-OPT) at iteration is . We verify that by setting the value of tol as given above for both algorithms, eigs returns a unit vector such that at each iteration . Thus, eigs returns an update direction such that for all .
Results for Diag problem instances.
The random instances generated were of two sizes (, ). For Diag problem instances, we randomly initialized all three algorithms to such that for each . For each problem instance, we performed 10 runs of each algorithm with different random initialization. Table 1 shows the computational results for GFW-ApproxI, GFW-ApproxII and GFW-Exact. The quantity denotes the number of iterations until the algorithms stop and , defined in (7.1), denotes an upper bound on . Note that the average is taken over 10 independent runs with different random initializations for each problem instance.
Algorithm | Avg time (secs) | Avg () | Avg () | |||
---|---|---|---|---|---|---|
500 | 50 | GFW-ApproxI | 652.33 | 4.8 | 2.948 | 356.79 |
500 | 50 | GFW-ApproxII | 682.33 | 7.2 | 3.03 | 337.07 |
500 | 50 | GFW-Exact | 1038.1 | 1.21 | 5.054 | 0.95 |
500 | 100 | GFW-ApproxI | 2374.9 | 19.2 | 5.717 | 398.36 |
500 | 100 | GFW-ApproxII | 2548 | 28.8 | 5.965 | 748.17 |
500 | 100 | GFW-Exact | 8407.4 | 4.81 | 20.12 | 2.02 |
From average and average in Table 1, we see that all three algorithms perform approximately two orders of magnitude better than the theoretical convergence bounds. Also, the standard deviation of shows small variation over random initialization of the algorithms. We also see that for both random instances of varying size, GFW-ApproxI and GFW-ApproxII perform lesser number of iterations than GFW-Exact before stopping. Figure 1 shows the change in the optimality gap with number of iterations for the three algorithms when they were used to generate an -optimal solution to a Diag instance of (SC-GMean-Cmp) with , . From the figure, we see that all three algorithms show similar trend in the change of optimality gap at each iteration, and after a few initial iterations, the algorithms converge linearly wrt to an -optimal solution. We also observe that GFW-ApproxI and GFW-ApproxII require slightly fewer iterations than GFW-Exact to converge to an -optimal solution.

Results for Rnd problem instances.
The three algorithms were randomly initialized to such that with for . For each problem instance, we performed 10 runs of each algorithm with different random initialization. The results for Rnd problem instances are given in Table 2. In the table, the quantity denotes the number of iterations until the algorithms stop and , defined in (7.1), denotes an upper bound on . The average values in the table arise from the different random initial points during each run of the algorithm.
Algorithm | Avg time (secs) | () | Avg | |||
---|---|---|---|---|---|---|
200 | 250 | GFW-ApproxI | 1.84 | 1.20 | 88.7 | 1.34 |
200 | 250 | GFW-ApproxII | 2.46 | 1.80 | 89.7 | 1.34 |
200 | 250 | GFW-Exact | 1.94 | 0.30 | 88.7 | 1.34 |
300 | 350 | GFW-ApproxI | 6.93 | 2.35 | 111.7 | 1.83 |
300 | 350 | GFW-ApproxII | 6.89 | 3.52 | 112.7 | 1.83 |
300 | 350 | GFW-Exact | 6.23 | 0.58 | 111.7 | 1.83 |
400 | 450 | GFW-ApproxI | 14.54 | 3.88 | 131.1 | 1.37 |
400 | 450 | GFW-ApproxII | 16.96 | 5.82 | 132.1 | 1.37 |
400 | 450 | GFW-Exact | 16.38 | 0.97 | 131.1 | 1.37 |
400 | 600 | GFW-ApproxI | 21.66 | 6.91 | 149.7 | 1.49 |
400 | 600 | GFW-ApproxII | 24.92 | 10.36 | 150.7 | 1.49 |
400 | 600 | GFW-Exact | 23.79 | 1.73 | 149.7 | 1.49 |
500 | 750 | GFW-ApproxI | 63.7 | 10.8 | 172.4 | 1.5 |
500 | 750 | GFW-ApproxII | 52.72 | 16.2 | 173.4 | 1.62 |
500 | 750 | GFW-Exact | 71.71 | 2.70 | 172.4 | 1.5 |
600 | 900 | GFW-ApproxI | 152 | 15.55 | 193.6 | 1.17 |
600 | 900 | GFW-ApproxII | 142.74 | 23.32 | 194.7 | 1.25 |
600 | 900 | GFW-Exact | 122.43 | 3.89 | 193.6 | 1.17 |
700 | 750 | GFW-ApproxI | 170.08 | 10.8 | 181.8 | 1.32 |
700 | 750 | GFW-ApproxII | 167.85 | 16.2 | 182.8 | 1.32 |
700 | 750 | GFW-Exact | 176.51 | 2.70 | 181.8 | 1.32 |
From the values of average and average in Table 2, we see that for each problem instance, all three algorithms satisfiy the stopping criterion after iterations. This value is several orders of magnitude smaller than the theoretical upper bounds given in (7.1). Furthermore, the standard deviation of shows a very small variation over the random initialization of the algorithms. Finally, we also observed that for varying values of and , all algorithms take the same number of iterations to converge to an -optimal solution. Thus, for the specific generated random instances, the approximation error in solving the linear subproblem does not affect the convergence of the algorithm. Furthermore, Figure 2 shows the change with the number of iterations for the three algorithms (with Rnd instance of size , as input) until the stopping criterion is satisfied. The y-axis shows the average of taken over 10 runs of the algorithms each with a different random initialization. Since , the y-axis denotes the upper bound on the optimality gap. Note that, for GFW-ApproxII, we have , and so, for GFW-ApproxII, has a value larger than that for GFW-ApproxI and GFW-Exact. Furthermore, all three algorithms show similar trend in the decrease in , and consequently, . We also observe that the quantity decreases exponentially wrt . Note that, this is a much faster convergence as compared to Diag instances (see Figure 1).
From these experiments (for Diag as well as Rnd instances), we see that our proposed method (Algorithm 3) has similar performance as Algorithm 1, i.e., the algorithm with exact linear minimization oracle.

8 Discussion
In this paper, we proposed a first-order algorithm, based on the Frank-Wolfe method, with provable approximation guarantees for composite optimization problems with -logarithmically-homogeneous, -self-concordant barrier function in the objective and a convex, compact feasible set. We showed that it is enough to solve the linear subproblem at each iteration approximately making Algorithm 2 more practical while also reducing the complexity of each iteration of the algorithm. We also showed that Algorithm 2 can generate an -optimal solution to (SC-OPT) when the oracle generates an approximate solution with either additive or relative approximation error. We analyzed Algorithm 2 for two different settings of the value of . However, it is possible that alternate techniques to set the value of could result in potentially faster convergence. Since the convergence of the algorithm depends on the initial optimality gap, we can potentially use warm starting techniques to speed up the method. We further proposed Algorithm 3 which generates an -optimal solution to (SC-OPT) with high probability when the linear minimization oracle has a nonzero probability of failure. Finally, we performed numerical experiments to show that the convergence analysis of our method holds true. The results of numerical experiments confirmed that the convergence bounds are satisfied for random instances of (SC-GMean).
Acknowledgements.
This material is based upon work supported by the National Science Foundation under Grant No. DMS-1929284 while the author, Nimita Shinde, was in residence at the Institute for Computational and Experimental Research in Mathematics in Providence, RI, during the Discrete Optimization: Mathematics, Algorithms, and Computation semester program.
Appendix A Preliminary results
Lemma 17.
Proof.
Zhao and Freund [36] state the relationship between and the duality gap , which is given as [36, Proposition 2.3]. The proof of (A.1) is adapted from the proof of [36, Proposition 2.3].
From the definition of , we have
(A.2) |
We now bound as
(A.3) |
where (i) follows from (2.5), (ii) follows from the definition of (given in (2.10)), and (iii) follows from (2.7). For simplicity, define . Then, from (2.6), (2.7) and (2.10), it follows that
(A.4) |
where (i) follows from (2.6), (ii) follows from (2.10), and (iii) follows from (2.7). Combining (A.2), (A.3) and (A.4),
where (i) uses that the fact that and , and (ii) uses the definition of (2.12). ∎
References
- Atwood [1969] Corwin L Atwood. Optimal and efficient designs of experiments. The Annals of Mathematical Statistics, pages 1570–1602, 1969.
- Bach [2010] Francis Bach. Self-concordant analysis for logistic regression. Electronic Journal of Statistics, 4:384–414, 2010.
- Baumgratz et al. [2013] Tillmann Baumgratz, Alexander Nüßeler, Marcus Cramer, and Martin B Plenio. A scalable maximum likelihood method for quantum state tomography. New Journal of Physics, 15(12):125004, 2013.
- Cover [1984] Thomas Cover. An algorithm for maximizing expected log investment return. IEEE Transactions on Information Theory, 30(2):369–373, 1984.
- de Aguiar et al. [1995] P Fernandes de Aguiar, B Bourguignon, MS Khots, DL Massart, and R Phan-Than-Luu. D-optimal designs. Chemometrics and intelligent laboratory systems, 30(2):199–210, 1995.
- Dvurechensky et al. [2020] Pavel Dvurechensky, Petr Ostroukhov, Kamil Safin, Shimrit Shtern, and Mathias Staudigl. Self-concordant analysis of Frank-Wolfe algorithms. In International Conference on Machine Learning, pages 2814–2824. PMLR, 2020.
- Dvurechensky et al. [2022] Pavel Dvurechensky, Kamil Safin, Shimrit Shtern, and Mathias Staudigl. Generalized self-concordant analysis of Frank–Wolfe algorithms. Mathematical Programming, pages 1–69, 2022.
- Frank and Wolfe [1956] Marguerite Frank and Philip Wolfe. An algorithm for quadratic programming. Naval research logistics, 3(1-2):95–110, 1956.
- Glancy et al. [2012] Scott Glancy, Emanuel Knill, and Mark Girard. Gradient-based stopping rules for maximum-likelihood quantum-state tomography. New Journal of Physics, 14(9):095017, 2012.
- Harmany et al. [2011] Zachary T Harmany, Roummel F Marcia, and Rebecca M Willett. This is SPIRAL-TAP: Sparse Poisson intensity reconstruction algorithms—theory and practice. IEEE Transactions on Image Processing, 21(3):1084–1096, 2011.
- Hradil [1997] Zdeněk Hradil. Quantum-state estimation. Physical Review A, 55(3):R1561, 1997.
- Hradil et al. [2004] Zdeněk Hradil, Jaroslav Řeháček, Jaromír Fiurášek, and Miroslav Ježek. Maximum-likelihood methods in quantum mechanics. In Quantum state estimation, pages 59–112. Springer, 2004.
- Jaggi [2013] Martin Jaggi. Revisiting Frank-Wolfe: Projection-free sparse convex optimization. In Proceedings of the 30th International Conference on Machine Learning, pages 427–435, 2013.
- Kuczyński and Woźniakowski [1992] Jacek Kuczyński and Henryk Woźniakowski. Estimating the largest eigenvalue by the power and Lanczos algorithms with a random start. SIAM Journal on Matrix Analysis and Applications, 13(4):1094–1122, 1992.
- Liu et al. [2021] Deyi Liu, Volkan Cevher, and Quoc Tran-Dinh. A Newton Frank–Wolfe method for constrained self-concordant minimization. Journal of Global Optimization, pages 1–27, 2021.
- Marteau-Ferey et al. [2019] Ulysse Marteau-Ferey, Dmitrii Ostrovskii, Francis Bach, and Alessandro Rudi. Beyond least-squares: Fast rates for regularized empirical risk minimization through self-concordance. In Conference on Learning Theory, pages 2294–2340. PMLR, 2019.
- Nemirovski and Todd [2008] Arkadi S Nemirovski and Michael J Todd. Interior-point methods for optimization. Acta Numerica, 17:191–234, 2008.
- Nesterov [2003] Yurii Nesterov. Introductory lectures on convex optimization: A basic course, volume 87. Springer Science & Business Media, 2003.
- Nesterov and Nemirovskii [1994] Yurii Nesterov and Arkadii Nemirovskii. Interior-point polynomial algorithms in convex programming. SIAM, 1994.
- Nowak and Kolaczyk [1998] Robert Nowak and Eric D Kolaczyk. A multiscale MAP estimation method for Poisson inverse problems. In Conference Record of Thirty-Second Asilomar Conference on Signals, Systems and Computers (Cat. No. 98CH36284), volume 2, pages 1682–1686. IEEE, 1998.
- Nowak and Kolaczyk [2000] Robert Nowak and Eric D Kolaczyk. A multiscale statistical framework for Poisson inverse problems. IEEE Trans. Info. Theory, 46:1811–1825, 2000.
- Owen [2013] Art B Owen. Self-concordance for empirical likelihood. Canadian Journal of Statistics, 41(3):387–397, 2013.
- Rue and Held [2005] Havard Rue and Leonhard Held. Gaussian Markov random fields: theory and applications. Chapman and Hall/CRC, 2005.
- Scholten and Blume-Kohout [2018] Travis L Scholten and Robin Blume-Kohout. Behavior of the maximum likelihood in quantum state tomography. New Journal of Physics, 20(2):023050, 2018.
- Shinde et al. [2021] Nimita Shinde, Vishnu Narayanan, and James Saunderson. Memory-efficient structured convex optimization via extreme point sampling. SIAM Journal on Mathematics of Data Science, 3(3):787–814, 2021.
- Tran-Dinh et al. [2013] Quoc Tran-Dinh, Anastasios Kyrillidis, and Volkan Cevher. A proximal Newton framework for composite minimization: Graph learning without Cholesky decompositions and matrix inversions. In International Conference on Machine Learning, pages 271–279. PMLR, 2013.
- Tran-Dinh et al. [2014] Quoc Tran-Dinh, Anastasios Kyrillidis, and Volkan Cevher. An inexact proximal path-following algorithm for constrained convex minimization. SIAM Journal on Optimization, 24(4):1718–1745, 2014.
- Tran-Dinh et al. [2015] Quoc Tran-Dinh, Anastasios Kyrillidis, and Volkan Cevher. Composite self-concordant minimization. Journal of Machine Learning Research, 16(1):371–416, 2015.
- Tropp et al. [2017] Joel A Tropp, Alp Yurtsever, Madeleine Udell, and Volkan Cevher. Practical sketching algorithms for low-rank matrix approximation. SIAM Journal on Matrix Analysis and Applications, 38(4):1454–1485, 2017.
- Tropp et al. [2019] Joel A Tropp, Alp Yurtsever, Madeleine Udell, and Volkan Cevher. Streaming low-rank matrix approximation with an application to scientific simulation. SIAM Journal on Scientific Computing, 41(4):A2430–A2463, 2019.
- Vardi and Lee [1993] Y Vardi and D Lee. From image deblurring to optimal investments: Maximum likelihood solutions for positive linear inverse problems. Journal of the Royal Statistical Society: Series B (Methodological), 55(3):569–598, 1993.
- Yuan and Parrilo [2021] Chenyang Yuan and Pablo A Parrilo. Semidefinite relaxations of products of nonnegative forms on the sphere. arXiv preprint arXiv:2102.13220, 2021.
- Yurtsever et al. [2017] Alp Yurtsever, Madeleine Udell, Joel A Tropp, and Volkan Cevher. Sketchy decisions: Convex low-rank matrix optimization with optimal storage. In Artificial Intelligence and Statistics, pages 1188–1196, 2017.
- Yurtsever et al. [2021] Alp Yurtsever, Joel A Tropp, Olivier Fercoq, Madeleine Udell, and Volkan Cevher. Scalable semidefinite programming. SIAM Journal on Mathematics of Data Science, 3(1):171–200, 2021.
- Zhao [2021] Renbo Zhao. Non-asymptotic convergence analysis of the multiplicative gradient algorithm for the log-optimal investment problems. arXiv preprint arXiv:2109.05601, 2021.
- Zhao and Freund [2022] Renbo Zhao and Robert M Freund. Analysis of the frank–wolfe method for convex composite optimization involving a logarithmically-homogeneous barrier. Mathematical Programming, pages 1–41, 2022.