∎
22email: [email protected] 33institutetext: Fan Chen 44institutetext: School of Mathematical Science, Peking University, Beijing, China
44email: [email protected] 55institutetext: Corresponding Author: Junyu Zhang 66institutetext: Department of Industrial Systems Engineering and Management, National University of Singapore, Singapore, Singapore
66email: [email protected] 77institutetext: Zaiwen Wen 88institutetext: International Center for Mathematical Research, Center for Machine Learning Research and College of Engineering, Peking University, Beijing, China
88email: [email protected]
On the Optimal Lower and Upper Complexity Bounds for a Class of Composite Optimization Problems
Abstract
We study the optimal lower and upper complexity bounds for finding approximate solutions to the composite problem , where is smooth and is convex. Given access to the proximal operator of , for strongly convex, convex, and nonconvex , we design efficient first order algorithms with complexities , , and , respectively. Here, is the condition number of the matrix in the composition, is the smoothness constant of , and is the condition number of in the strongly convex case. is the initial point distance and is the initial function value gap. Tight lower complexity bounds for the three cases are also derived and they match the upper bounds up to logarithmic factors, thereby demonstrating the optimality of both the upper and lower bounds proposed in this paper.
Keywords:
composite optimization first order method upper boundlower boundMSC:
90C2590C2690C4690C601 Introduction
In this paper, we consider the composite optimization problem:
(1) |
where is a differentiable function whose gradient is -Lipschitz continuous, is a proper closed convex function whose proximal operator can be efficiently evaluated, is a matrix with full row rank and its minimal singular value is . We denote the condition number of matrix as . Problem (1) appears in many practical applications, such as distributed optimization shi2015extra ; chang2016asynchronous ; hong2017prox , signal and image processing yang2011alternating ; zhu2008efficient ; chambolle2011first , network optimization feijer2010stability and optimal control bertsekas2012dynamic . In this paper, we aim to figure out the tight upper and lower complexity bounds for all three cases of strongly convex, convex, and non-convex function .
When is chosen to be the indicator function , the problem becomes the linear equality constrained problem
(2) |
More generally, when is chosen to be where is a proper cone, the problem becomes the conic inequality constrained problem
(3) |
Alternatively, when is a norm function, the problem becomes the norm regularized problem
(4) |
The motivation of this paper originates from the study of the complexity of the non-convex optimization with linear equality constraints (2), which corresponds to the special case of (1) when and is non-convex. This problem has received considerable attention in the literature. The authors of kong2019complexity combine the proximal point method and the quadratic penalty method, archiving an complexity for finding an -stationary point. Under the assumption that the domain is bounded, the authors of kong2023iteration develop an algorithm that solves proximal augmented Lagrangian subproblem by accelerated composite gradient method, resulting in a complexity of .
Several recent works have further improved the complexity to , including hong2017prox ; sun2019distributed ; zhang2020proximal ; zhang2022global . The key factor contributing to this improvement is the incorporation of the minimal nonzero singular value in the analysis of complexity bounds. Specifically, the condition number of is required to be bounded by some constant. In hong2017prox , a proximal primal-dual algorithm is developed, while the dependence on is very large. In sun2019distributed , a novel optimal method for distributed optimization is proposed. Yet its dependence on Chebyshev acceleration prevents its generalization to problems with . The authors of zhang2020proximal focus on the problem with an extra box constraint while it requires assuming the complementarity property, which is not easy to check in advance. The subsequent work zhang2022global relaxes the assumption, but the dependence on other parameters remains unclear.
Therefore, it is natural to ask what are the complexity upper and lower bounds of first-order methods for the non-convex linearly constrained problem (2). To answer this question, we consider the general form problem (1) and utilize the norm of the proximal gradient mapping as the convergence measure. When reduces to an indicator function which corresponds to the linear equality constrained problem (2), this error measure reduces to the norm of the projected gradient mapping . We rigorously define the first-order linear span algorithm class and construct a hard instance to establish a lower bound of where quantifies the gap in the objective function at the initial point. To demonstrate the tightness of the lower bound, we design an efficient algorithm based on the idea of accelerated proximal point algorithm (APPA), which achieves the bound of .
The APPA approach solves a non-convex problem by successively solving a sequence of strongly convex subproblems. Hence, the optimality of the complexity for solving problem (1) with strongly convex is also crucial. One straightforward approach is to introduce an additional variable and apply ADMM to solve the linear constrained problem where one function in the objective is strongly convex and the other is convex. For example, a sublinear convergence rate is obtained in cai2017convergence and lin2015sublinear . The lower bound is examined by constructing a challenging linear equality constrained problem in ouyang2021lower . Assuming a strongly convex parameter , the derived lower bound is where is the optimal dual variable. Remarkably, this dominant term matches the upper bound provided in xu2021iteration up to logarithmic factors.
If we further incorporate the minimal nonzero singular value of into the complexity bound, it is possible to “break” the lower bound established in ouyang2021lower . For example, the ADMM and a set of primal-dual methods are proved to have linear convergence rates (lin2015global, ; zhu2022unified, , etc.). Furthermore, for the linear equality constrained problem (2) with strongly convex , the algorithms proposed in zhu2022unified and salim2022optimal achieve complexity of and , respectively. However, salim2022optimal utilizes Chebyshev iteration as a sub-routine, and is hence restricted to the linear equality constrained problems. An lower bound is also claimed for the linear equality constrained case in salim2022optimal .
For completeness, we also discuss the case with general convex , which has been extensively investigated. The authors of ouyang2015accelerated and xu2017accelerated design algorithms that converge to an -optimal solution with a complexity of . zhu2022unified proposes a unified framework that covers several well-known primal-dual algorithms and establishes an ergodic convergence rate of . The lower bound is provided in ouyang2021lower and the dominant term is which matches the upper bounds provided in ouyang2015accelerated . Recent researches also establish complexity results better than under stronger assumptions. On the linear constrained problem with constraints, xu2022first utilizes the cutting-plane method as the subroutine of ALM and proves a dimension-dependent complexity of . For a class of bilinear and smooth minimax problems (which include linear constrained problems), song2020breaking proposes an algorithm that achieves complexity, but its dependence on other constants remains suboptimal. Under our setting, we demonstrate that both lower and upper bounds can achieve by deducing from the strongly convex case.
Contribution. Our results are listed in Table 1 and our main contributions are summarized as follows.
-
•
Under the assumption of bounded , we establish the complexity lower bounds for solving (1) with non-convex, strongly convex, and convex , within the first-order linear span algorithm class, by constructing three hard linear equality constrained instances.
-
•
By exploiting the idea of the (accelerated) proximal point algorithm, we design efficient algorithms to solve problem (1) in all three cases.
-
•
Under the assumption of bounded , we prove that the complexities of our algorithms match the lower bounds up to logarithmic factors. Therefore, these complexities are optimal.
-
•
For the special case of linear equality constrained problem (2), we prove that the full row rank assumption can be removed. The upper bounds keep unchanged except that is replaced by , with being the minimum nonzero singular value.
Setting | Optimality Measure | Complexity |
Strongly Convex | ||
Convex | ||
Non-convex | ||
Related works. For this paper, there are several closely related works. The first one that we would like to discuss is song2020breaking . It is designed for a class of smooth bilinear minimax problems, which includes convex linearly constrained problems after proper reformulation. With some extra computation, the results of song2020breaking indicate an complexity, where upper bounds the norm of the optimal dual variables. The additional term makes their algorithm suboptimal for our problem class. Second, for linear equality constrained problems, their results require the constraint matrix to be full rank, while our result does not require this property. In our paper, we provide a tight lower complexity bound for this problem class and an optimal first-order algorithm that achieves the lower bound.
The second closely related work is salim2022optimal , which derived an complexity for the smooth strongly convex linear equality constrained problem (2). Let be some properly selected Chebyshev polynomial, they proposed an accelerated algorithm with Chebyshev iteration based on the equivalence between and . However, this technique is restricted to linear equality constrained problems as the aforementioned equivalence failed for inequality contraints and more general . Moreover, their algorithm requires the smoothness of the objective function over the whole space, and they do not allow any constraints other than the linear equality ones. Hence their method cannot handle general non-smooth by introducing and sloving an equality constrained problem. This is because the new term violates both the global smoothness and joint strong convexity assumptions in their paper. It is worth mentioning that the lower bound in salim2022optimal is indeed a valid lower bound for our problem. However, since the construction of our lower bound for the general convex case requires the lower bound for the strongly convex case as an intermediate step, we provide a self-contained lower bound for the strongly convex case in this paper. The hard instance and the structure of the zero chain in our example are both different from those in salim2022optimal .
The last related work that we would like to discuss is sun2019distributed where an optimal first-order algorithm for non-convex distributed optimization is proposed. Like salim2022optimal , their results require the global smoothness of the objective function and the Chebyshev acceleration and is hence hard to generalize to . A lower bound is also proposed in sun2019distributed paper for non-convex distributed optimization. However, it is not clear how their lower bound can be reduced to our case and both their lower and upper bounds include an unusual dependence on the squared norm of the initial gradient. Therefore, instead of trying to reduce from their results, we construct a new hard instance that lower bounds the complexity for making the proximal gradient mapping small.
Organization. Section 2 provides fundamental information and defines the quantities to measure the suboptimality of an approximate solution. In Section 3, we discuss the upper bounds for each of the three cases. For the sake of readability, we present an efficient algorithm for the strongly convex case first. Then, we utilize this algorithm to solve the proximal point subproblems of the non-convex case. We also utilize this algorithm to solve the general convex case by adding -strongly convex perturbation and obtain a corresponding upper bound. Thus, we follow the order from strongly convex to non-convex to convex in the discussion. Then in Section 4, we present formal definitions of problem classes and algorithm classes and provide the corresponding lower bounds. Finally, in Section 5, we conduct a detailed analysis of the linear equality constrained problem. In particular, we relax the full rank assumption on matrix . A direct acceleration for the general convex case is also presented in the last section, instead of adding strongly convex perturbations.
Notations. We denote the Euclidean inner product by and the Euclidean norm by . Let be a differentiable function of two variables. To denote the partial gradient of with respect to the first (or second) variable at the point , we use (or ). The full gradient at is denoted as , where . Suppose is a closed convex set, we use to represent the projection operator onto . For any matrix , we use to denote its range space. For any vector , we use to denote the support of , i.e., , where . The indicator function for a set is defined as if and if . The identity matrix in is denoted as . The positive part of a real number is represented as , defined as .
2 Preliminaries
2.1 Basic facts and definitions
Lipschitz Smoothness. For a differentiable function , we say it is -smooth if
Strong Convexity. For a positive constant , we say is -strongly convex if is convex, and it is -strongly concave if is strongly convex.
Conjugate Function. The conjugate function of a function is defined as
It is well-known that is convex. Furthermore, when is assumed to be strongly convex or smooth, its conjugate function has the following properties hiriart1996convex .
Lemma 1
If is -strongly convex, then its conjugate is -smooth. If is -smooth, then its conjugate is -strongly convex.
Proximal Operator. For a proper closed convex function , the corresponding proximal operator is defined as
2.2 Suboptimality measure
To study the upper and lower complexity bounds of first-order methods for problem (1), it is essential to define the appropriate suboptimality measures for the strongly convex, convex, and non-convex cases, respectively. In the following sections, we will present these measures one by one.
Strongly convex case. In this case, the optimal solution is unique. We can define the suboptimal measure for a point as its squared distance to the optimal solution
(5) |
Non-convex case. In the non-convex case, we define the suboptimal measure as
which measures the violation of the first-order optimality condition. It is worth pointing out that the operator is not needed in the algorithm. We only utilize it as a formal measure but do not need to directly evaluate it in our algorithms.
Convex case. The convex case is a little bit complicated. The standard suboptimality measure may not fit our setting, since could happen when . For example, for linearly constrained problems where , typically first-order methods only guarantee returning solutions with small constraint violation instead of exact constraint satisfaction. Thus the objective function gap will always be . To handle this issue, we propose the following suboptimality measure
where is replaced by the surrogate function given by
(6) |
For any . can be viewed as a Lipschitz approximation of , as the following lemma indicates.
Lemma 2
For any , the followings hold
-
1.
, and as .
-
2.
is -Lipschitz continuous, and if is -Lipschitz continuous, then .
-
3.
If where is a proper cone, we have .
Proof
Part 1. .
For any , let , then we have . It holds if is -Lipschitz continuous. Hence, we have .
Remark 1
For norm regularized problems (4) where is an arbitrary norm, let be its dual norm, then the conjugate function . As long as , it holds that
In particular, when , it is sufficient to take .
Remark 2
For the linear inequality constrained problems (2) with , Lemma 2 indicates that Therefore, we have
Let and be the optimal primal and dual solutions respectively, it is a standard lemma that (see e.g. (zhu2022unified, , Lemma 3)) when , we have
Similarly, for linear equality constrained problems with , we have . Then , if . Therefore, essentially agrees with the widely used suboptimality measure of convex linear constrained problems, see e.g. xu2021first ; zhu2022unified ; hamedani2021primal . Furthermore, we can also cover the problem formulation that with both equality and inequality constraints studied in zhang2022global ; zhang2020proximal , if we let .
2.3 Nesterov’s accelerated gradient descent
In this paper, we will frequently use Nesterov’s accelerated gradient descent (AGD) as a subroutine, as stated in Algorithm 1. It is the optimal first-order algorithm for the smooth strongly convex optimization problems nesterov2018lectures .
The following proposition is a simple corollary of (nesterov2018lectures, , Theorem 2.2.2).
Proposition 1
Assume that is -smooth and -strongly convex. Denote and . Algorithm 1 terminates in
iterations, and the output satisfies .
3 Upper bounds
3.1 Strongly convex case: Intuition
Though is easy to evaluate, may not have efficient proximal mapping. Therefore, by utilizing the conjugate function of , we decouple the composite structure by reformulating (1) as a convex-concave saddle point problem
(7) |
Switching the order of and , and minimizing with respect to , we obtain the dual problem of (1):
(8) |
The following lemma illustrates that the dual problem is strongly concave.
Lemma 3
is -strongly concave with .
Proof
Note that is -smooth and -strongly convex, Lemma 1 indicates that is -smooth and -strongly convex. Denote , then for any , we have
which implies that is -strongly convex. Combining the fact that is convex, we complete the proof. ∎
One can observe that the Lipschitz continuity of is transferred to the strong concavity of through the matrix . Therefore, a linear convergence can be expected. To exploit this observation, we perform an inexact proximal point algorithm to solve the dual problem (8):
(9) |
where is the iterate of the -th step. Now it remains to solve the subproblem (9) efficiently. By expanding the term through the conjugate function again, we can rewrite (9) into the equivalent saddle point problem
(10) |
Let . Then the dual problem of (9) is given by
(11) |
The function is -smooth and -strongly convex, with . This time, the strong convexity induced by the proximal term for the dual variable is transferred to the Lipschitz smoothness in the primal variable . With a few more computations, we show that can be easily evaluated, and hence AGD can be applied to solve (11).
Proposition 2
The gradient can be evaluated with one call of , one call of , one call of , and one call of , respectively.
Proof
With , Danskin’s theorem indicates that
In fact, can be explicitly written as
(12) |
where the last equality comes from Moreau’s decomposition theorem (showalter1997monotone, , Proposition IV.1.8). ∎
3.2 Strongly convex case: Analysis
Now, we are ready to give the complete algorithm for strongly convex problems.
Proposition 3
Suppose that is -smooth and -strongly convex, the matrix satisfies and the minimum singular value of is no smaller than . Let be the pair of optimal primal and dual variables, and be the iterate sequence generated by Algorithm 2. Assume that , then we have for any , it holds that
(13) |
where . If we denote , then for any , it holds that
(14) | ||||
(15) |
Proof
Let and . Recall that the output of Algorithm guarantees that . According to the dual update rule (12) and the optimality of , we have
Thus, we can deduce due to the non-expansiveness of proximal operator. By the definition of , it holds that
which implies that there exists satisfying . Combining the fact that is -strongly concave and , we have
Hence, for any constant , we have
Picking and utilizing the assumption that yields that
where we denote . Therefore, by recursively applying the inequality above, we have
By our choice of , it holds
Putting these pieces together and plugging in the definition of and , we get (13).
Recall the definition of given in (11). We can it rewrite it into
where is defined by
Note that is the conjugate function of , and hence it follows from Lemma 1 that is -smooth. By definition, we know , and hence
where the first inequality follows from the -strong-convexity of , and the last inequality holds because is -smooth. Combining the fact that , we get (14).
Similarly, by the optimality of , we have is the minimum of
Repeating the argument above yields (15). ∎
Theorem 3.1
One comment here is that although we assume knowing an upper bound of the distance from to the optimal solution , it only appears in the logarithmic terms. Hence one can use a very loose upper estimation of the distance in the algorithm without deteriorating the performance.
Proof
We first consider the case . By combining (14) and 13, we have
where the last inequality holds because . Therefore, by Proposition 1 and our choice , the inner number of steps of can be upper bound by
where the last inequality follows from plugging in the definition of , and . The case follows similarly.
Corollary 1
Proof
According to Theorem 3.1, if we let , then the complexity of each inner loop is , the complexity of outer loop is . Therefore, the overall complexity is .
3.3 Non-convex case
For non-convex problems, our algorithm is present in Algorithm 3, which employs the inexact proximal point algorithm in the outer iterations while solving the strongly convex subproblems via Algorithm 2.
Under a suitable sequence of , we provide the convergence rate of Algorithm 3 in the following theorem.
Theorem 3.2
Suppose that is -smooth, the condition number of is . Assume that , and be the iterate sequence generated by Algorithm 3. Then we have
In particular, suppose that the initial point and for some , then taking yields
(17) |
Proof
By the definition of , on the one hand, we have , which yields
(18) |
On the other hand, it holds
Therefore, we have
where the last inequality holds because the proximal operator is non-expansive and is -smooth. Combining with (18), we obtain
Summing up the above inequality over ,
Since , it holds that
Furthermore, under the assumption of , it holds and accordingly . This completes the proof. ∎
In Theorem 3.2, we present the first inequality by incorporating the definition of specifically for the scenario where . This is necessary because in such cases, can be infinite, as exemplified by when it is an indicator function. Consequently, it becomes impossible to find a finite that satisfies . By introducing the definition of , we ensure the existence of a finite and thus establish a well-defined result.
Corollary 2
Proof
By Theorem 3.2, to reach the expected precision, we need outer iterations. For each , the function is -strongly-convex and -smooth, and hence its condition number is . Hence, the number of gradient evaluations in the -th inner iteration with Algorithm 2 is by Corollary 1. Combining the complexities of inner and outer loops, we obtain the overall complexity. For the case , the complexity can be derived similarly.∎
3.4 Convex case
For any given and , we construct the following auxiliary problem:
(19) |
The smooth part is strongly convex and hence we can apply Algorithm 2 to solve the problem. The following corollary illustrates that the approximate solution of (19) is also an approximate solution of the original convex problem and the overall complexity is also optimal.
Corollary 3
Proof
Denote the exact solution of (19) as . We apply Algorithm 2 on (19) and calculate a point such that , where we will specify later in proof.
By the optimality of , it holds that
In particular, we have , and by the fact that , we know .
On the other hand, we have
where the last inequality follows from . Further, since is -Lipschitz continuous and , it holds
Denote . Combining the above two inequalities yields
Therefore, we can set
to ensure that . Notice that the function is -smooth and -strongly convex. Therefore, according to Corollary 1, the required number of gradient evaluations is . ∎
Note that we give some specific definitions of in Property 3 and 2. In the following, we give the complexity results on these specific problems.
Corollary 4
For the conic inequality constrained problem (3) and any fixed , in order to find an approximate solution satisfies
the required number of gradient evaluation is .
Corollary 4 implies that for conic inequality constrained convex problems (including linearly constrained convex problems), the constraint can be fulfilled to arbitrary accuracy without affecting the order of the complexity (up to log factor).
Corollary 5
When is -Lipschitz continuous (e.g., the norm regularized problem (4)), in order to find an approximate solution satisfies , the required number of gradient evaluation is .
4 Lower bounds
4.1 Problem classes and algorithm class
In this section, we aim to construct three hard instances for the strongly convex, convex, and non-convex cases, respectively. First, let us formally define the problem classes and the linear span first-order algorithm class. For the simplicity of presentation, we construct the hard instances with and .
Strongly convex problem class. For positive constants , , the problem class includes problems in which is -smooth and -strongly convex, , and the condition number of is upper bounded by .
Non-convex problem class. For positive constants and , the problem class includes problems where is -smooth, and the condition number of is upper bounded by .
Convex problem class. For positive constants , the problem class includes problems in which is -smooth, , and the condition number of is upper bounded by .
For the above three problem classes, we restrict our discussion to first-order linear span algorithms. The results can be extended to the general first-order algorithms without linear span structure by the orthogonal invariance trick proposed in carmon2020lower .
First-order linear span algorithms. The iterate sequence is generated such that . These subspaces are generated by starting with , and
If we assume that , then for the linear equality constrained problem (2), and the algorithm class degenerates into
We can further assume without loss of generality, otherwise, we can consider the shifted problem .
Note that for a first-order linear span algorithm, it is not necessary to use the current gradient in each iteration. Instead, it can use any combination of points from the historical search space. This makes the algorithm class general enough to cover diverse iteration schemes. To give some specific examples, we present following single-loop and double-loop algorithms covered under the considered algorithm class.
Example 1 (Single-loop algorithms)
Consider problem (2) with , the Chambolle-Pock method chambolle2011first , the OGDA method mokhtari2020convergence and the linearized ALM xu2021first update iterates by the following rules
(Chambolle-Pock) |
(OGDA) |
(Linearized ALM) |
where is a penalty factor.
For problem class , a unified analysis on the above three methods was provided in zhu2022unified , and an complexity is achieved.
Example 2 (Double-loop algorithm xu2021iteration )
Consider problem (2) with . Let be the augmented Lagrangian function. ALM generates iterates by
where the subproblem is solved by an inner loop of Nesterov’s AGD method. An complexity for convex problems and an complexity for strongly convex problems are derived in xu2021iteration .
Remark 3
It can be checked that all three algorithms proposed in the upper bound section belong to the defined first order linear span algorithm class for general .
4.2 The construction of hard instance
In this section, we construct the hard instances for any first-order linear span algorithms. Specifically, we consider the linear equality constrained problem (2) with . For positive integers and , we define the following problem
(20) |
where and is the vector that stacks together in order, the component function is a smooth function to be determined later. To ensure that satisfies the assumptions of different problem classes, we will construct various formulations of in the strongly convex, convex, and non-convex cases, respectively. Additionally, we require to satisfy the following assumption.
Assumption 4.1
For any , it holds
-
1.
If , then and .
-
2.
If , then and .
Lemma 4
For matrix defined in (21), its condition number satisfies .
Proof
Note that is a block tridiagonal Toeplitz matrix and its eigenvalue is (see noschese2013tridiagonal ). Accordingly, the condition number of satisfies
where the last inequality is due to , . Consequently, . ∎
Next, we demonstrate the propagation of non-zero entries in this example. For first-order linear span algorithm with , we have
It implies that new non-zero entries are introduced either through , or through the action of on . As is a block tridiagonal matrix, each action of enables entries in to ”communicate” with their neighboring vectors, thereby propagating the non-zero entries.
Figure 1 illustrates how non-zero entries propagate. Assume that the initial point is for all and . In the first iteration, we use Assumption 4.1 to observe that and , so it is only possible to have . In the second iteration, does not introduce any new non-zero entries, but the action of on causes to receive a non-zero entry from . In the third iteration, we have , which allows to become non-zero. Additionally, propagates the non-zero entry in to . By repeating the above propagation mechanism, we can see that by the th iteration, both and become nonzero. In the th iteration, becomes nonzero through . We can consider iterations to (which consist of iterations) as one complete round of iterations. By repeating this process, each round of iteration can convert up to elements to nonzero. After rounds of iteration, specifically at the th iteration, and become possibly nonzero.
Based on the above procedure, it is natural to obtain the following Lemma.
Lemma 5
For with , we have
where the pair represents the entry . Therefore, for any , let , then for any satisfy .
It is well known that the complexity lower bound of the strongly convex case for an unconstrained smooth problem is nesterov2018lectures , the lower bound of the convex case is nesterov2018lectures and the lower bound of non-convex case is carmon2020lower . These papers construct hard instances with zero-chaining property, where only one possible non-zero entry is added to the decision variable at each iteration (see Chapter 2 of nesterov2018lectures ). In contrast, Lemma 5 indicates that in order to add a non-zero entry to each , it is necessary to take at least iterations, that is, iterations by Lemma 4. Therefore, our complexity lower bounds need to be multiplied by an additional factor of on top of the lower bounds of unconstrained problems. This intuitively yields our results listed in Table 1.
4.3 Strongly convex case
Let us construct our hard problem instance based on formulation (20). For any given positive parameters and positive integers , we define function as
(23) |
where
The construction is based on Nesterov’s well-known “chain-like” quadratic function nesterov2018lectures . In the following, we give some properties of the constructed problem instance.
Lemma 6
Proof
Part 1: Let us fix the vectors with , and define . If we take and , it holds
On the one hand, due to Cauchy inequality, we have . On the other hand,
where the last inequality follows from . Therefore, is convex and -smooth, hence is -strongly convex and -smooth, which implies is also -strongly convex and -smooth.
Part 2: For any satisfies the constraint in (20), we have and . Thus, we only need to verify that is the (unique) minimum point of the function . By the optimality condition , we have
where we denote . Note that is the smallest root of the quadratic equation . By a direct calculation, we can check that satisfies the equations, and hence is the minimum of the function . Lastly, it holds
where follows from , holds because and .∎
Now we are ready to give our lower bound result for the strongly convex case.
Theorem 4.2
Proof
By property (1), we know is -smooth and -strongly convex. The condition number of is not great than , and can be suitably chosen so that . Thus the instance we construct belongs to the problem class . According to and Lemma 5 and our choice of , we have for any and . Therefore,
where the last inequality comes from Property 2. Notice that and . Substituting them into the above inequality yields
(26) |
Plugging in the definition and the fact that completes the proof. ∎
Corollary 6
For any first-order algorithm in , parameters and , there exists a problem instance in such that at least
iterations are required in order to find an iterate satisfies .
4.4 Non-convex case
For the hard instance we present in (20), the definition of suboptimal measure becomes
where refers to the feasible region of problem (20), i.e., . For given positive integer , we define the function as
(27) |
where function and are defined as
The function in formulation (20) is a scaled version of and its formal definition will be given later. Let us discuss first. Define and one can observe coincides with the hard instance constructed in carmon2020lower . We give some useful properties of .
Lemma 7
The function has the following properties.
-
1.
If , then .
-
2.
.
-
3.
is -smooth with being a universal constant that is independent of the problem dimension and other constants.
Proof
Part 1 comes from carmon2020lower .
To prove Part 3, fix with , define the function by the directional projection of along , i.e., . Taking and , We have
By simple derivations, we can obtain and . It follows
Therefore, it holds
Since and , we obtain
which completes the proof. ∎
For given positive parameters , we construct the following scaled function based on ,
(28) |
where can be adjusted to fulfill the condition . Similarly, we define . By simple derivations, we can generalize Lemma 7 to the case of .
Corollary 7
The function has the following properties.
-
1.
.
-
2.
If , then .
-
3.
is -smooth.
Next, we given a key lemma that gives the relationship between and .
Lemma 8
For any , let . Suppose that is -smooth, then we have
Proof
Denote . On the one hand, it holds that
Therefore, we have
(29) |
where the last equality holds because . On the other hand, recall that , by chain rule we have
(30) |
For the first term on the right hand side, we have
where the second inequality is due to the -smoothness of . Similar derivation also applies to the second term on the right hand side of (30). Therefore, it holds
where and come from Cauchy–Schwarz inequality, utilizes equality (29). ∎
We can now state a complexity lower bound for finding an approximate solution of non-convex problems with first-order linear span algorithms.
Theorem 4.3
Proof
By property (3) in Corollary 7, we know is -smooth. According to the definition of , it holds . As proved in Theorem 4.2, the condition number of is not great than . Accordingly, the instance problem belongs to the class . Since , the last element of all vectors is zero, that is, . It implies . Utilizing Lemma 8 and property (2) in Corollary 7, we obtain
By definition, we have , and this gives the desired result. ∎
Corollary 8
For any first-order algorithm in and where is defined in Theorem 4.3, there exists a problem instance in such that at least
iterations are required in order to find an iterate satisfies .
4.5 Convex case
For the linear equality constrained problem, the definition of suboptimality becomes
Recall that for , implies the standard optimality measure .
Note that in Section 4.3, we consider the strongly convex problem class with . In this section, we demonstrate how the complexity lower bound present in Theorem 4.2 can be reduced to the convex problem class with .
Theorem 4.4
Proof
We can suitably choose so that . Then, under our choice of parameters, we have . Note that , and hence . Therefore, using the fact that is -strongly convex and , we have
where the last two inequalities are due to (26). According to the choice of and , we have , and hence complete proof. ∎
Corollary 9
For any first-order algorithm in and any where is defined in Theorem 4.4, there exists a problem instance in such that at least
iterations are required in order to find an iterate satisfies with .
5 Linear equality constrained problem
In this section, we focus on a special case of (1) when , corresponding to the linear equality constrained problem
(31) |
In Section 5.1, it will be demonstrated that the requirement of full row rank for the matrix can be removed. In Section 5.2, we will show that the optimal iteration complexity can be achieved by directly applying APPA on the convex problem.
5.1 Refined result of linear equality constrained problem
First, we consider the strongly convex case. Similar with the derivation in Section 3.1, problem (31) is equivalent to
Due to the strong duality, we have . Accordingly, the corresponding dual problem can be written as
(32) |
Without the assumption that is full row rank, the proof of Lemma 3 no longer holds and is not necessarily strongly convex. Actually, we have the following property.
Lemma 9
Let be the nonzero minimal singular value of . It holds that is -strongly concave in the subspace .
Proof
Noting that is -smooth and -strongly convex, it is easy to derive that is -smooth and -strongly convex. Then for any , we have
where the last inequality holds because . ∎
According to the update rule of Algorithm 2, if we set , then we have
Due to the feasibility of the constraint , we have . Hence it holds that the iterate sequence always stays in . Combining Lemma 9, we can treat as a strongly concave function during the iterations. Then the derivation in Section 3.1 still holds by replacing with and replacing with . Consequently, the derived complexity upper bound is .
For the convex problem, Algorithm 2 is utilized to solve the strongly convex auxiliary problem (19), hence the upper bound can also be generalized to . For the nonconvex problem, Algorithm 2 is utilized to solve the subproblem (16). Therefore, the complexity upper bound can be similarly generalized to .
5.2 Direct acceleration for convex problem
Recall that in Section 3.4, we derive the upper bound of the convex problem by constructing a strongly convex auxiliary problem. In this section, we propose an optimal algorithm, present in Algorithm 4, that solves the original convex problem directly. The algorithm performs an inexact accelerated proximal point method on while keeping the constraint intact in the outer loop. In the inner loop, it uses Algorithm 2 to iteratively solve the subproblem until it meets the first-order suboptimality conditions.
(33) |
The convergence proof is inspired by beck2009fast and jiang2012inexact . Different from several work that studies the inexact accelerated PPA for the unconstrained convex problem, we have an additional linear equality constraint here. Since the projection onto the linear constraint is not allowed, we need to incorporate the violation of the constraint into the analysis, which makes the convergence proof quite different. For simplicity, we define the following notations:
Due to the KKT condition, we know and . It follows that , which corresponds to the Bregman divergence associated with for point and . In other words, measures the distance between and under the Bregman divergence and can somewhat serves as a suboptimality measure. In the following analysis, we will give an upper bound of (or ) first and eventually derive an upper bound for the objective function gap and the constraint violation.
According to the update rule and , it is easy to obtain the following lemma, which will be frequently used in the following proofs.
Lemma 10
The sequence generated by Algorithm 4 satisfies for any .
In the following lemma, we give the one-step estimation of .
Lemma 11
For any , it holds
(34) |
Proof
For each and any , we have
(35) |
Note that . We apply (35) with and to get
(36) |
where we utilize the fact that . Similarly, let and in (35), we obtain
(37) |
Multiply (36) by and add it to (37). Then, multiply on the both side of the obtained inequality and notice the relation , we have
For the first two terms on the right hand side of the above inequality, we use the usual Pythagoras relation , then substitute into it. After rearranging, we obtain
Combining the fact yields the conclusion. ∎
Now, having the inequality (34), we can derive an upper bound of .
Lemma 12
For any , it holds
(38) |
Proof
By applying (35) with and , we have
where we utilize the usual Pythagoras relation . Noting that , and , we have
Since , the above inequality implies
(39) |
Let . Utilizing (34) repeatedly and combining (39), we obtain
(40) |
Since , we have , which implies
By some simple derivations, we get
(41) |
From the inequality (39), we know that , which follows . Accordingly,
(42) |
Applying (41) repeatedly and combining (42) yields
where the second inequality holds because and by the definition of . The above inequality implies
(43) |
and hence . Substituting it into (40), we obtain
which completes the proof. ∎
Note that the right hand side of (38) depends on , which is still unclear. We further estimate this term in the following lemma.
Lemma 13
Assume that , and for any , we have
(44) | ||||
(45) |
Proof
By the update rule , it holds
Thus, we have due to the fact that and . Combining the KKT condition with the definition of , we obtain
Accordingly, it follows
(46) |
where the inequality uses the smoothness of and . Since , the above inequality actually implies
It follows that for any ,
(47) |
where the last inequality uses the .
According to Lemma 12 and Lemma 13, it remains an estimate of to get the upper bound of . In the discussion that follows, we prove that can be bounded by a linear increasing sequence. Utilizing this property, we give the requirements on and , and obtain the following result.
Theorem 5.1
Proof
We can check that the assumptions in Lemma 12 are satisfied: , and for any . We also have and .
First, we discuss the upper bound of . By definition,
where the last inequality holds because . By (40), it holds . It follows
which implies
Substituting (43) into the above inequality and combining (45), we get
For the simplicity of notation, we denote ,
,
,
and the above inequality can be rewritten into
which implies
By simple derivations, we have
(48) |
On the one hand, we observe that the upper bound of depends on and , which further depend on . On the other hand, from (45), we know can be bounded by . Observing the relationship, we can prove the upper bound of both and by induction. It is easy to derive that , and . Let , . Let be the maximum zero point of (51) and
(49) |
Note that and only depend on the constants and . We aim to prove and .
First, we prove the case when . Note that . By (40), we have
Combining with (42) yields
(50) |
Putting (50) and (44) together and combining yield a quartic inequality equation with respect to :
(51) |
Recall the definition of and . We get . Thus, it holds
where the last inequality holds because .
Suppose that there exists such and for any . Then we have and . Hence and (48) implies
where the third inequality is due to and the last inequality holds because is greater than the maximum zero point of the equation . Then we can obtain due to (45) and . Consequently, by induction we conclude that for any . It follows . Combining the fact that and inequality (38) yields
which completes the proof. ∎
Corollary 10
Proof
In Theorem 5.1, we have proved the outer loop complexity is . Let . The KKT conditions are and . By the definition of and , we have
It follows and . Let , then the subroutine only needs to output a pair such that and , the required subproblem error tolerance can be satisfied. The condition number of the objective function of the subproblem (33) is . Since is the power function of and other parameters, we obtain the number of gradient evaluations in each inner iteration is by Corollary 1. Consequently, the overall complexity is . ∎
Remark 4
Note that Algorithm 4 outputs an approximate solution of the last subproblem, which satisfies . If we further require that , the overall complexity remains unchanged up to logarithmic factors since the inner loop converges linearly. Therefore, in order to obtain an approximate solution satisfying and , the complexity upper bound is still . In this case, we can conclude that because . Therefore, the above result actually implies an complexity upper bound of in order to find an approximate solution satisfying and .
6 Conclusions
In this work, we analyze the lower and upper bounds of composite optimization problems in strongly convex, convex, and non-convex scenarios. Different from most of previous studies, we specifically consider problem classes with a given condition number . Our results demonstrate that the complexities presented are optimal up to logarithmic factors. This study marks the first instance of optimal algorithms for convex and non-convex cases, as well as the first set of lower bounds for all three cases. In the future work, it remains interesting to investigate how algorithms can be designed to further tighten the logarithmic factors in the complexity upper bounds.
Funding This work was supported by the National Natural Science Foundation of China under grant number 11831002.
Data Availability
Declarations
Conflict of interest The authors have no relevant financial interest to disclose.
References
- (1) Beck, A., Teboulle, M.: A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM Journal on Imaging Sciences 2(1), 183–202 (2009)
- (2) Bertsekas, D.: Dynamic programming and optimal control: Volume I, vol. 1. Athena scientific (2012)
- (3) Cai, X., Han, D., Yuan, X.: On the convergence of the direct extension of ADMM for three-block separable convex minimization models with one strongly convex function. Computational Optimization and Applications 66, 39–73 (2017)
- (4) Carmon, Y., Duchi, J.C., Hinder, O., Sidford, A.: Lower bounds for finding stationary points I. Mathematical Programming 184(1), 71–120 (2020)
- (5) Chambolle, A., Pock, T.: A first-order primal-dual algorithm for convex problems with applications to imaging. Journal of Mathematical Imaging and Vision 40, 120–145 (2011)
- (6) Chang, T.H., Hong, M., Liao, W.C., Wang, X.: Asynchronous distributed ADMM for large-scale optimization—part I: Algorithm and convergence analysis. IEEE Transactions on Signal Processing 64(12), 3118–3130 (2016)
- (7) Feijer, D., Paganini, F.: Stability of primal–dual gradient dynamics and applications to network optimization. Automatica 46(12), 1974–1981 (2010)
- (8) Hamedani, E.Y., Aybat, N.S.: A primal-dual algorithm with line search for general convex-concave saddle point problems. SIAM Journal on Optimization 31(2), 1299–1329 (2021)
- (9) Hiriart-Urruty, J., Lemarechal, C.: Convex Analysis and Minimization Algorithms II: Advanced Theory and Bundle Methods. Grundlehren der mathematischen Wissenschaften. Springer Berlin Heidelberg (1996). URL https://books.google.com.sg/books?id=aSizI0n6tnsC
- (10) Hong, M., Hajinezhad, D., Zhao, M.M.: Prox-PDA: The proximal primal-dual algorithm for fast distributed nonconvex optimization and learning over networks. In: International Conference on Machine Learning, pp. 1529–1538. PMLR (2017)
- (11) Jiang, K., Sun, D., Toh, K.C.: An inexact accelerated proximal gradient method for large scale linearly constrained convex sdp. SIAM Journal on Optimization 22(3), 1042–1064 (2012)
- (12) Kong, W., Melo, J.G., Monteiro, R.D.: Complexity of a quadratic penalty accelerated inexact proximal point method for solving linearly constrained nonconvex composite programs. SIAM Journal on Optimization 29(4), 2566–2593 (2019)
- (13) Kong, W., Melo, J.G., Monteiro, R.D.: Iteration complexity of an inner accelerated inexact proximal augmented lagrangian method based on the classical lagrangian function. SIAM Journal on Optimization 33(1), 181–210 (2023)
- (14) Lin, T., Ma, S., Zhang, S.: On the global linear convergence of the ADMM with multiblock variables. SIAM Journal on Optimization 25(3), 1478–1497 (2015)
- (15) Lin, T., Ma, S., Zhang, S.: On the sublinear convergence rate of multi-block ADMM. Journal of the Operations Research Society of China 3, 251–274 (2015)
- (16) Mokhtari, A., Ozdaglar, A.E., Pattathil, S.: Convergence rate of for optimistic gradient and extragradient methods in smooth convex-concave saddle point problems. SIAM Journal on Optimization 30(4), 3230–3251 (2020)
- (17) Nesterov, Y., et al.: Lectures on convex optimization, vol. 137. Springer (2018)
- (18) Noschese, S., Pasquini, L., Reichel, L.: Tridiagonal Toeplitz matrices: properties and novel applications. Numerical Linear Algebra with Applications 20(2), 302–326 (2013)
- (19) Ouyang, Y., Chen, Y., Lan, G., Pasiliao Jr, E.: An accelerated linearized alternating direction method of multipliers. SIAM Journal on Imaging Sciences 8(1), 644–681 (2015)
- (20) Ouyang, Y., Xu, Y.: Lower complexity bounds of first-order methods for convex-concave bilinear saddle-point problems. Mathematical Programming 185(1-2), 1–35 (2021)
- (21) Salim, A., Condat, L., Kovalev, D., Richtárik, P.: An optimal algorithm for strongly convex minimization under affine constraints. In: International Conference on Artificial Intelligence and Statistics, pp. 4482–4498. PMLR (2022)
- (22) Shi, W., Ling, Q., Wu, G., Yin, W.: Extra: An exact first-order algorithm for decentralized consensus optimization. SIAM Journal on Optimization 25(2), 944–966 (2015)
- (23) Showalter, R.: Monotone operators in banach space and nonlinear partial differential equations. Math. Surv. Mono. 49 (1997)
- (24) Song, C., Jiang, Y., Ma, Y.: Breaking the optimal rate for a class of minimax problems. arXiv preprint arXiv:2003.11758 (2020)
- (25) Sun, H., Hong, M.: Distributed non-convex first-order optimization and information processing: Lower complexity bounds and rate optimal algorithms. IEEE Transactions on Signal processing 67(22), 5912–5928 (2019)
- (26) Xu, Y.: Accelerated first-order primal-dual proximal methods for linearly constrained composite convex programming. SIAM Journal on Optimization 27(3), 1459–1484 (2017)
- (27) Xu, Y.: First-order methods for constrained convex programming based on linearized augmented lagrangian function. INFORMS Journal on Optimization 3(1), 89–117 (2021)
- (28) Xu, Y.: Iteration complexity of inexact augmented lagrangian methods for constrained convex programming. Mathematical Programming 185, 199–244 (2021)
- (29) Xu, Y.: First-order methods for problems with functional constraints can have almost the same convergence rate as for unconstrained problems. SIAM Journal on Optimization 32(3), 1759–1790 (2022)
- (30) Yang, J., Zhang, Y.: Alternating direction algorithms for ell_1-problems in compressive sensing. SIAM journal on scientific computing 33(1), 250–278 (2011)
- (31) Zhang, J., Luo, Z.Q.: A proximal alternating direction method of multiplier for linearly constrained nonconvex minimization. SIAM Journal on Optimization 30(3), 2272–2302 (2020)
- (32) Zhang, J., Luo, Z.Q.: A global dual error bound and its application to the analysis of linearly constrained nonconvex optimization. SIAM Journal on Optimization 32(3), 2319–2346 (2022)
- (33) Zhu, M., Chan, T.: An efficient primal-dual hybrid gradient algorithm for total variation image restoration. Ucla Cam Report 34, 8–34 (2008)
- (34) Zhu, Z., Chen, F., Zhang, J., Wen, Z.: A unified primal-dual algorithm framework for inequality constrained problems. arXiv preprint arXiv:2208.14196 (2022)