This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

11institutetext: Zhenyuan Zhu 22institutetext: Beijing International Center for Mathematical Research, Peking University, Beijing, China
22email: [email protected]
33institutetext: Fan Chen 44institutetext: School of Mathematical Science, Peking University, Beijing, China
44email: [email protected]
55institutetext: Corresponding Author: Junyu Zhang 66institutetext: Department of Industrial Systems Engineering and Management, National University of Singapore, Singapore, Singapore
66email: [email protected]
77institutetext: Zaiwen Wen 88institutetext: International Center for Mathematical Research, Center for Machine Learning Research and College of Engineering, Peking University, Beijing, China
88email: [email protected]

On the Optimal Lower and Upper Complexity Bounds for a Class of Composite Optimization Problems

Zhenyuan Zhu    Fan Chen    Junyu Zhang    Zaiwen Wen
Abstract

We study the optimal lower and upper complexity bounds for finding approximate solutions to the composite problem minxf(x)+h(Axb)\min_{x}\ f(x)+h(Ax-b), where ff is smooth and hh is convex. Given access to the proximal operator of hh, for strongly convex, convex, and nonconvex ff, we design efficient first order algorithms with complexities O~(κAκflog(1/ϵ))\tilde{O}\left(\kappa_{A}\sqrt{\kappa_{f}}\log\left(1/{\epsilon}\right)\right), O~(κALfD/ϵ)\tilde{O}\left(\kappa_{A}\sqrt{L_{f}}D/\sqrt{\epsilon}\right), and O~(κALfΔ/ϵ2)\tilde{O}\left(\kappa_{A}L_{f}\Delta/\epsilon^{2}\right), respectively. Here, κA\kappa_{A} is the condition number of the matrix AA in the composition, LfL_{f} is the smoothness constant of ff, and κf\kappa_{f} is the condition number of ff in the strongly convex case. DD is the initial point distance and Δ\Delta is the initial function value gap. Tight lower complexity bounds for the three cases are also derived and they match the upper bounds up to logarithmic factors, thereby demonstrating the optimality of both the upper and lower bounds proposed in this paper.

Keywords:
composite optimization first order method upper boundlower bound
MSC:
90C2590C2690C4690C60

1 Introduction

In this paper, we consider the composite optimization problem:

minxF(x):=f(x)+h(Axb),\min_{x}\ F(x)\mathrel{\mathop{:}}=f(x)+h(Ax-b), (1)

where f:nf:\mathbb{R}^{n}\mapsto\mathbb{R} is a differentiable function whose gradient is LfL_{f}-Lipschitz continuous, h:nh:\mathbb{R}^{n}\mapsto\mathbb{R} is a proper closed convex function whose proximal operator can be efficiently evaluated, Am×nA\in\mathbb{R}^{m\times n} is a matrix with full row rank and its minimal singular value is μA\mu_{A}. We denote the condition number of matrix AA as κA:=AμA\kappa_{A}\mathrel{\mathop{:}}=\frac{\|A\|}{\mu_{A}}. Problem (1) appears in many practical applications, such as distributed optimization shi2015extra ; chang2016asynchronous ; hong2017prox , signal and image processing yang2011alternating ; zhu2008efficient ; chambolle2011first , network optimization feijer2010stability and optimal control bertsekas2012dynamic . In this paper, we aim to figure out the tight upper and lower complexity bounds for all three cases of strongly convex, convex, and non-convex function ff.

When h()h(\cdot) is chosen to be the indicator function h()=𝕀{0}()h(\cdot)=\mathbb{I}_{\{0\}}(\cdot), the problem becomes the linear equality constrained problem

minxf(x),s.t.Ax=b.\min_{x}f(x),\quad\,\textrm{s.t.}\,Ax=b. (2)

More generally, when h()h(\cdot) is chosen to be h()=𝕀𝒦()h(\cdot)=\mathbb{I}_{\mathcal{K}}(\cdot) where 𝒦\mathcal{K} is a proper cone, the problem becomes the conic inequality constrained problem

minxf(x),s.t.Axb𝒦.\min_{x}\ f(x),\quad\,\textrm{s.t.}\,Ax-b\in\mathcal{K}. (3)

Alternatively, when h()=h(\cdot)=\left\|\cdot\right\|_{*} is a norm function, the problem becomes the norm regularized problem

minxf(x)+Axb.\min_{x}\ f(x)+\|Ax-b\|_{*}. (4)

The motivation of this paper originates from the study of the complexity of the non-convex optimization with linear equality constraints (2), which corresponds to the special case of (1) when h()=𝕀{0}()h(\cdot)=\mathbb{I}_{\{0\}}(\cdot) and f(x)f(x) is non-convex. This problem has received considerable attention in the literature. The authors of kong2019complexity combine the proximal point method and the quadratic penalty method, archiving an O(ϵ3)O(\epsilon^{-3}) complexity for finding an ϵ\epsilon-stationary point. Under the assumption that the domain is bounded, the authors of kong2023iteration develop an algorithm that solves proximal augmented Lagrangian subproblem by accelerated composite gradient method, resulting in a complexity of O~(ϵ2.5)\tilde{O}(\epsilon^{-2.5}).

Several recent works have further improved the complexity to O(ϵ2)O(\epsilon^{-2}), including hong2017prox ; sun2019distributed ; zhang2020proximal ; zhang2022global . The key factor contributing to this improvement is the incorporation of the minimal nonzero singular value in the analysis of complexity bounds. Specifically, the condition number of AA is required to be bounded by some constant. In hong2017prox , a proximal primal-dual algorithm is developed, while the dependence on κA\kappa_{A} is very large. In sun2019distributed , a novel optimal method for distributed optimization is proposed. Yet its dependence on Chebyshev acceleration prevents its generalization to problems with h()𝕀{0}()h(\cdot)\neq\mathbb{I}_{\{0\}}(\cdot). The authors of zhang2020proximal focus on the problem with an extra box constraint while it requires assuming the complementarity property, which is not easy to check in advance. The subsequent work zhang2022global relaxes the assumption, but the dependence on other parameters remains unclear.

Therefore, it is natural to ask what are the complexity upper and lower bounds of first-order methods for the non-convex linearly constrained problem (2). To answer this question, we consider the general form problem (1) and utilize the norm of the proximal gradient mapping Lfx𝐩𝐫𝐨𝐱1Lfh(Ab)(x12Lff(x))L_{f}\left\|x-\mathbf{prox}_{\frac{1}{L_{f}}h(A\cdot-b)}\left(x-\frac{1}{2L_{f}}\nabla f(x)\right)\right\| as the convergence measure. When hh reduces to an indicator function h()=𝕀{0}()h(\cdot)=\mathbb{I}_{\{0\}}(\cdot) which corresponds to the linear equality constrained problem (2), this error measure reduces to the norm of the projected gradient mapping Lfx𝒫{Ax=b}(x12Lff(x))L_{f}\left\|x-\mathcal{P}_{\{Ax=b\}}\left(x-\frac{1}{2L_{f}}\nabla f(x)\right)\right\|. We rigorously define the first-order linear span algorithm class and construct a hard instance to establish a lower bound of Ω(κALfΔ/ϵ2)\Omega\left(\kappa_{A}L_{f}\Delta/\epsilon^{2}\right) where Δ\Delta quantifies the gap in the objective function at the initial point. To demonstrate the tightness of the lower bound, we design an efficient algorithm based on the idea of accelerated proximal point algorithm (APPA), which achieves the bound of O~(κALfΔ/ϵ2)\tilde{O}\left(\kappa_{A}L_{f}\Delta/\epsilon^{2}\right).

The APPA approach solves a non-convex problem by successively solving a sequence of strongly convex subproblems. Hence, the optimality of the complexity for solving problem (1) with strongly convex ff is also crucial. One straightforward approach is to introduce an additional variable y=Axby=Ax-b and apply ADMM to solve the linear constrained problem where one function in the objective is strongly convex and the other is convex. For example, a sublinear convergence rate is obtained in cai2017convergence and lin2015sublinear . The lower bound is examined by constructing a challenging linear equality constrained problem in ouyang2021lower . Assuming a strongly convex parameter μf\mu_{f}, the derived lower bound is Ω(Aλμfϵ)\Omega\left(\frac{\|A\|\|\lambda^{\star}\|}{\mu_{f}\sqrt{\epsilon}}\right) where λ\lambda^{\star} is the optimal dual variable. Remarkably, this dominant term matches the upper bound provided in xu2021iteration up to logarithmic factors.

If we further incorporate the minimal nonzero singular value of AA into the complexity bound, it is possible to “break” the lower bound established in ouyang2021lower . For example, the ADMM and a set of primal-dual methods are proved to have linear convergence rates (lin2015global, ; zhu2022unified, , etc.). Furthermore, for the linear equality constrained problem (2) with strongly convex ff, the algorithms proposed in zhu2022unified and salim2022optimal achieve complexity of O((κA2+κf)log(1/ϵ))O\left((\kappa_{A}^{2}+\kappa_{f})\log(1/\epsilon)\right) and O(κAκflog(1/ϵ))O\left(\kappa_{A}\sqrt{\kappa_{f}}\log(1/\epsilon)\right), respectively. However, salim2022optimal utilizes Chebyshev iteration as a sub-routine, and is hence restricted to the linear equality constrained problems. An Ω(κAκflog(1/ϵ))\Omega\left(\kappa_{A}\sqrt{\kappa_{f}}\log(1/\epsilon)\right) lower bound is also claimed for the linear equality constrained case in salim2022optimal .

For completeness, we also discuss the case with general convex ff, which has been extensively investigated. The authors of ouyang2015accelerated and xu2017accelerated design algorithms that converge to an ϵ\epsilon-optimal solution with a complexity of O(ϵ1)O(\epsilon^{-1}). zhu2022unified proposes a unified framework that covers several well-known primal-dual algorithms and establishes an ergodic convergence rate of O(ϵ1)O(\epsilon^{-1}). The lower bound is provided in ouyang2021lower and the dominant term is Ω(Axλ/ϵ)\Omega\left({\|A\|\|x^{\star}\|\|\lambda^{\star}\|}/{\epsilon}\right) which matches the upper bounds provided in ouyang2015accelerated . Recent researches also establish complexity results better than O(ϵ1)O(\epsilon^{-1}) under stronger assumptions. On the linear constrained problem with O(1)O(1) constraints, xu2022first utilizes the cutting-plane method as the subroutine of ALM and proves a dimension-dependent complexity of O(1/ϵ)O(1/\sqrt{\epsilon}). For a class of bilinear and smooth minimax problems (which include linear constrained problems), song2020breaking proposes an algorithm that achieves O(1/ϵ)O\left(1/\sqrt{\epsilon}\right) complexity, but its dependence on other constants remains suboptimal. Under our setting, we demonstrate that both lower and upper bounds can achieve Θ~(κADL/ϵ)\tilde{\Theta}(\kappa_{A}D\sqrt{L/\epsilon}) by deducing from the strongly convex case.

Contribution. Our results are listed in Table 1 and our main contributions are summarized as follows.

  • Under the assumption of bounded κA\kappa_{A}, we establish the complexity lower bounds for solving (1) with non-convex, strongly convex, and convex ff, within the first-order linear span algorithm class, by constructing three hard linear equality constrained instances.

  • By exploiting the idea of the (accelerated) proximal point algorithm, we design efficient algorithms to solve problem (1) in all three cases.

  • Under the assumption of bounded κA\kappa_{A}, we prove that the complexities of our algorithms match the lower bounds up to logarithmic factors. Therefore, these complexities are optimal.

  • For the special case of linear equality constrained problem (2), we prove that the full row rank assumption can be removed. The upper bounds keep unchanged except that κA\kappa_{A} is replaced by κ¯A=A/σ¯min\underline{\kappa}_{A}=\|A\|/\underline{\sigma}_{\min}, with σ¯min\underline{\sigma}_{\min} being the minimum nonzero singular value.

Setting Optimality Measure Complexity
Strongly Convex xx\|x-x^{\star}\| O~(κAκflog(1/ϵ))\tilde{O}\left(\kappa_{A}\sqrt{\kappa_{f}}\log\left(1/{\epsilon}\right)\right)
Ω(κAκflog(1/ϵ))\Omega\left(\kappa_{A}\sqrt{\kappa_{f}}\log\left(1/{\epsilon}\right)\right)
Convex f(x)+hρ(Axb)minxF(x)f(x)+h_{\rho}(Ax-b)-\min_{x}F(x) O~(κALfD/ϵ)\tilde{O}\left(\kappa_{A}\sqrt{L_{f}}D/\sqrt{\epsilon}\right)
Ω(κALfD/ϵ)\Omega\left(\kappa_{A}\sqrt{L_{f}}D/\sqrt{\epsilon}\right)
Non-convex Lfx𝐩𝐫𝐨𝐱1Lfh(Ab)(x12Lff(x))L_{f}\left\|x-\mathbf{prox}_{\frac{1}{L_{f}}h(A\cdot-b)}\left(x-\frac{1}{2L_{f}}\nabla f(x)\right)\right\| O~(κALfΔ/ϵ2)\tilde{O}\left(\kappa_{A}L_{f}\Delta/\epsilon^{2}\right)
Ω(κALfΔ/ϵ2)\Omega\left(\kappa_{A}L_{f}\Delta/\epsilon^{2}\right)
Table 1: Summary of our results. The gray cells represent the lower bounds. xx^{\star}: the optimal solution, x0domFx_{0}\in\mathrm{dom}F: the initial point, hρ()h_{\rho}(\cdot): a surrogate function of h()h(\cdot) (defined in (6)), 𝐩𝐫𝐨𝐱\mathbf{prox}: the proximal operator, LfL_{f}: Lipschitz constant of f(x)\nabla f(x), μf\mu_{f}: strong convexity parameter of f(x)f(x), κf=Lf/μf\kappa_{f}=L_{f}/\mu_{f}, κA=A/μA\kappa_{A}=\|A\|/\mu_{A}, F(x0)F(x)ΔF(x_{0})-F(x^{*})\leq\Delta, x0xD\|x_{0}-x^{*}\|\leq D.

Related works. For this paper, there are several closely related works. The first one that we would like to discuss is song2020breaking . It is designed for a class of smooth bilinear minimax problems, which includes convex linearly constrained problems after proper reformulation. With some extra computation, the results of song2020breaking indicate an O(κA(D+D)Lf/ϵ)O\big{(}\kappa_{A}(D+D^{*})\sqrt{L_{f}/\epsilon}\big{)} complexity, where DD^{*} upper bounds the norm of the optimal dual variables. The additional O(κADLf/ϵ)O\big{(}\kappa_{A}D^{*}\sqrt{L_{f}/\epsilon}\big{)} term makes their algorithm suboptimal for our problem class. Second, for linear equality constrained problems, their results require the constraint matrix AA to be full rank, while our result does not require this property. In our paper, we provide a tight lower complexity bound for this problem class and an optimal first-order algorithm that achieves the lower bound.

The second closely related work is salim2022optimal , which derived an O(κAκflog(1/ϵ))O\left(\kappa_{A}\sqrt{\kappa_{f}}\log(1/\epsilon)\right) complexity for the smooth strongly convex linear equality constrained problem (2). Let P()P(\cdot) be some properly selected Chebyshev polynomial, they proposed an accelerated algorithm with Chebyshev iteration based on the equivalence between Ax=bAx=b and P(AA)x=P(AA)x\sqrt{P(A^{\top}A)}x=\sqrt{P(A^{\top}A)}x^{*}. However, this technique is restricted to linear equality constrained problems as the aforementioned equivalence failed for inequality contraints (h()=𝕀+n())\big{(}h(\cdot)=\mathbb{I}_{\mathbb{R}^{n}_{+}}(\cdot)\big{)} and more general hh. Moreover, their algorithm requires the smoothness of the objective function over the whole space, and they do not allow any constraints other than the linear equality ones. Hence their method cannot handle general non-smooth h()𝕀{0}()h(\cdot)\neq\mathbb{I}_{\{0\}}(\cdot) by introducing y=Axby=Ax-b and sloving an equality constrained problem. This is because the new h(y)h(y) term violates both the global smoothness and joint strong convexity assumptions in their paper. It is worth mentioning that the Ω(κAκflog(1/ϵ))\Omega\left(\kappa_{A}\sqrt{\kappa_{f}}\log(1/\epsilon)\right) lower bound in salim2022optimal is indeed a valid lower bound for our problem. However, since the construction of our lower bound for the general convex case requires the lower bound for the strongly convex case as an intermediate step, we provide a self-contained lower bound for the strongly convex case in this paper. The hard instance and the structure of the zero chain in our example are both different from those in salim2022optimal .

The last related work that we would like to discuss is sun2019distributed where an optimal first-order algorithm for non-convex distributed optimization is proposed. Like salim2022optimal , their results require the global smoothness of the objective function and the Chebyshev acceleration and is hence hard to generalize to h()𝕀{0}()h(\cdot)\neq\mathbb{I}_{\{0\}}(\cdot). A lower bound is also proposed in sun2019distributed paper for non-convex distributed optimization. However, it is not clear how their lower bound can be reduced to our case and both their lower and upper bounds include an unusual dependence on the squared norm of the initial gradient. Therefore, instead of trying to reduce from their results, we construct a new hard instance that lower bounds the complexity for making the proximal gradient mapping small.

Organization.  Section 2 provides fundamental information and defines the quantities to measure the suboptimality of an approximate solution. In Section 3, we discuss the upper bounds for each of the three cases. For the sake of readability, we present an efficient algorithm for the strongly convex case first. Then, we utilize this algorithm to solve the proximal point subproblems of the non-convex case. We also utilize this algorithm to solve the general convex case by adding O(ϵ)O(\epsilon)-strongly convex perturbation and obtain a corresponding upper bound. Thus, we follow the order from strongly convex to non-convex to convex in the discussion. Then in Section 4, we present formal definitions of problem classes and algorithm classes and provide the corresponding lower bounds. Finally, in Section 5, we conduct a detailed analysis of the linear equality constrained problem. In particular, we relax the full rank assumption on matrix AA. A direct acceleration for the general convex case is also presented in the last section, instead of adding strongly convex perturbations.

Notations. We denote the Euclidean inner product by ,\langle\cdot,\cdot\rangle and the Euclidean norm by \|\cdot\|. Let f(,):m×nf(\cdot,\cdot):\mathbb{R}^{m}\times\mathbb{R}^{n}\mapsto\mathbb{R} be a differentiable function of two variables. To denote the partial gradient of ff with respect to the first (or second) variable at the point (x,y)(x,y), we use xf(x,y)\nabla_{x}f(x,y) (or yf(x,y)\nabla_{y}f(x,y)). The full gradient at (x,y)(x,y) is denoted as f(x,y)\nabla f(x,y), where f(x,y)=(xf(x,y),yf(x,y))\nabla f(x,y)=(\nabla_{x}f(x,y),\nabla_{y}f(x,y)). Suppose \mathcal{M} is a closed convex set, we use 𝒫()\mathcal{P}_{\mathcal{M}}(\cdot) to represent the projection operator onto \mathcal{M}. For any matrix Am×nA\in\mathbb{R}^{m\times n}, we use (A)\mathcal{R}(A) to denote its range space. For any vector xnx\in\mathbb{R}^{n}, we use supp{x}\operatorname{supp}\left\{x\right\} to denote the support of xx, i.e., supp{x}:={i[n]xi0}\operatorname{supp}\left\{x\right\}\mathrel{\mathop{:}}=\left\{i\in[n]\mid x_{i}\neq 0\right\}, where [n]={1,,n}[n]=\{1,\cdots,n\}. The indicator function for a set 𝒮\mathcal{S} is defined as 𝕀𝒮(x)=0\mathbb{I}_{\mathcal{S}}(x)=0 if x𝒮x\in\mathcal{S} and 𝕀𝒮(x)=+\mathbb{I}_{\mathcal{S}}(x)=+\infty if x𝒮x\notin\mathcal{S}. The identity matrix in d×d\mathbb{R}^{d\times d} is denoted as IdI_{d}. The positive part of a real number is represented as []+[\cdot]_{+}, defined as [x]+=max{x,0}[x]_{+}=\max\{x,0\}.

2 Preliminaries

2.1 Basic facts and definitions

Lipschitz Smoothness. For a differentiable function f(x)f(x), we say it is LfL_{f}-smooth if

f(x)f(x)Lfxx,x,x.\|\nabla f(x)-f(x^{\prime})\|\leq L_{f}\|x-x^{\prime}\|,\quad\forall x,x^{\prime}.

Strong Convexity. For a positive constant μf>0\mu_{f}>0, we say f(x)f(x) is μf\mu_{f}-strongly convex if f(x)μf2x2f(x)-\frac{\mu_{f}}{2}\|x\|^{2} is convex, and it is μf\mu_{f}-strongly concave if f(x)-f(x) is strongly convex.

Conjugate Function. The conjugate function of a function f()f(\cdot) is defined as

f(y)=supx{xTyf(x)}.f^{\star}(y)=\sup_{x}\{x^{\mathrm{T}}y-f(x)\}.

It is well-known that ff^{\star} is convex. Furthermore, when ff is assumed to be strongly convex or smooth, its conjugate function ff^{\star} has the following properties hiriart1996convex .

Lemma 1

If ff is μf\mu_{f}-strongly convex, then its conjugate ff^{\star} is 1μf\frac{1}{\mu_{f}}-smooth. If ff is LfL_{f}-smooth, then its conjugate ff^{\star} is 1Lf\frac{1}{L_{f}}-strongly convex.

Proximal Operator. For a proper closed convex function hh, the corresponding proximal operator is defined as

𝐩𝐫𝐨𝐱h(x):=argminu{h(u)+12xu2}.\mathbf{prox}_{h}(x)\mathrel{\mathop{:}}=\operatorname*{arg\,min}_{u}\left\{h(u)+\frac{1}{2}\|x-u\|^{2}\right\}.

2.2 Suboptimality measure

To study the upper and lower complexity bounds of first-order methods for problem (1), it is essential to define the appropriate suboptimality measures for the strongly convex, convex, and non-convex cases, respectively. In the following sections, we will present these measures one by one.

Strongly convex case.   In this case, the optimal solution is unique. We can define the suboptimal measure for a point xx as its squared distance to the optimal solution

SubOpt𝖲𝖢(x):=xx2.\operatorname{SubOpt}_{\mathsf{SC}}(x)\mathrel{\mathop{:}}=\|x-x^{\star}\|^{2}. (5)

Non-convex case.   In the non-convex case, we define the suboptimal measure as

SubOpt𝖭𝖢(x):=Lfx𝐩𝐫𝐨𝐱1Lfh(Ab)(x12Lff(x)),\operatorname{SubOpt}_{\mathsf{NC}}(x)\mathrel{\mathop{:}}=L_{f}\left\|x-\mathbf{prox}_{\frac{1}{L_{f}}h(A\cdot-b)}\left(x-\frac{1}{2L_{f}}\nabla f(x)\right)\right\|,

which measures the violation of the first-order optimality condition. It is worth pointing out that the operator 𝐩𝐫𝐨𝐱1Lfh(Ab)\mathbf{prox}_{\frac{1}{L_{f}}h(A\cdot-b)} is not needed in the algorithm. We only utilize it as a formal measure but do not need to directly evaluate it in our algorithms.

Convex case.   The convex case is a little bit complicated. The standard suboptimality measure F(x)infxF(x)F(x)-\inf_{x^{\prime}}F(x^{\prime}) may not fit our setting, since h(Axb)=+h(Ax-b)=+\infty could happen when AxbdomhAx-b\not\in\mathrm{dom}h. For example, for linearly constrained problems where h(Axb)=𝕀{Ax=b}(x)h(Ax-b)=\mathbb{I}_{\{Ax=b\}}(x), typically first-order methods only guarantee returning solutions with small constraint violation instead of exact constraint satisfaction. Thus the objective function gap will always be ++\infty. To handle this issue, we propose the following suboptimality measure

SubOpt𝖢(x):=f(x)+hρ(Axb)minx{f(x)+h(Axb)},\operatorname{SubOpt}_{\mathsf{C}}(x)\mathrel{\mathop{:}}=f(x)+h_{\rho}(Ax-b)-\min_{x^{\prime}}\left\{f(x^{\prime})+h(Ax^{\prime}-b)\right\},

where h()h(\cdot) is replaced by the surrogate function hρ()h_{\rho}(\cdot) given by

hρ(z):=supy2ρ{zTyh(y)}.h_{\rho}(z)\mathrel{\mathop{:}}=\sup_{\|y\|_{2}\leq\rho}\{z^{\mathrm{T}}y-h^{\star}(y)\}. (6)

For any ρ(0,+)\rho\in(0,+\infty). hρ(x)h_{\rho}(x) can be viewed as a Lipschitz approximation of h(x)h(x), as the following lemma indicates.

Lemma 2

For any ρ(0,+)\rho\in(0,+\infty), the followings hold

  1. 1.

    hρ(x)h(x)h_{\rho}(x)\leq h(x), and hρ(x)h(x)h_{\rho}(x)\to h(x) as ρ+\rho\to+\infty.

  2. 2.

    hρ(x)h_{\rho}(x) is ρ\rho-Lipschitz continuous, and if h(x)h(x) is ρ\rho-Lipschitz continuous, then hρ(x)h(x)h_{\rho}(x)\equiv h(x).

  3. 3.

    If h(x)=𝕀𝒦(x)h(x)=\mathbb{I}_{\mathcal{K}}(x) where 𝒦\mathcal{K} is a proper cone, we have hρ(x)=ρ𝒫𝒦(x)h_{\rho}(x)=\rho\|\mathcal{P}_{\mathcal{K}^{\circ}}(x)\|.

Proof

Part 1. hρ(x)=supxρ{xTyh(x)}supx{xTyh(x)}h(x)h_{\rho}(x)=\sup_{\|x\|\leq\rho}\{x^{\mathrm{T}}y-h^{\star}(x)\}\leq\sup_{x}\{x^{\mathrm{T}}y-h^{\star}(x)\}\leq h(x).

Part 2. For any x,xx,x^{\prime}, we have

hρ(x)\displaystyle h_{\rho}(x) =supyρ{yT(xx)+yTxh(y)}\displaystyle=\sup_{\|y\|\leq\rho}\{y^{\mathrm{T}}(x-x^{\prime})+y^{\mathrm{T}}x^{\prime}-h^{\star}(y)\}
supyρyT(xx)+supyρ{yTxh(y)}\displaystyle\leq\sup_{\|y\|\leq\rho}y^{\mathrm{T}}(x-x^{\prime})+\sup_{\|y\|\leq\rho}\{y^{\mathrm{T}}x^{\prime}-h^{\star}(y)\}
ρxx+hρ(x),\displaystyle\leq\rho\|x-x^{\prime}\|+h_{\rho}(x^{\prime}),

which implies that hρ(x)h_{\rho}(x) is ρ\rho-Lipschitz continuous.

For any xx, let y(x)=argmaxy{xTyh(y)}y^{\star}(x)=\operatorname*{arg\,max}_{y}\{x^{\mathrm{T}}y-h^{\star}(y)\}, then we have y(x)h(x)y^{\star}(x)\in\partial h(x). It holds y(x)ρ\|y^{\star}(x)\|\leq\rho if h(x)h(x) is ρ\rho-Lipschitz continuous. Hence, we have hρ(x)h(x)h_{\rho}(x)\equiv h(x).

Part 3. When h(x)=𝕀𝒦(x)h(x)=\mathbb{I}_{\mathcal{K}}(x), the conjugate function is h(y)=𝕀𝒦(y)h^{\star}(y)=\mathbb{I}_{\mathcal{K}^{\circ}}(y) where 𝒦\mathcal{K}^{\circ} is the polar cone of 𝒦\mathcal{K}. Therefore, we have

hρ(x)=supyρ,y𝒦xTy=xT(ρ𝒫𝒦(x)𝒫𝒦(x))=ρ𝒫𝒦(x).h_{\rho}(x)=\sup\limits_{\|y\|\leq\rho,y\in\mathcal{K}^{\circ}}x^{\mathrm{T}}y=x^{\mathrm{T}}\left(\frac{\rho}{\|\mathcal{P}_{\mathcal{K}^{\circ}}(x)\|}\mathcal{P}_{\mathcal{K}^{\circ}}(x)\right)=\rho\|\mathcal{P}_{\mathcal{K}^{\circ}}(x)\|.

Remark 1

For norm regularized problems (4) where h(x)=xh(x)=\|x\|_{*} is an arbitrary norm, let \left\|\cdot\right\|^{*} be its dual norm, then the conjugate function h(y)=𝕀y1(y)h^{\star}(y)=\mathbb{I}_{\left\|y\right\|^{*}\leq 1}(y). As long as ρxx,x\rho\left\|x\right\|^{*}\geq\left\|x\right\|,\forall x, it holds that

hρ(x)=supyρ,y1xTy=supy1xTy=x=h(x).h_{\rho}(x)=\sup\limits_{\|y\|\leq\rho,\left\|y\right\|^{*}\leq 1}x^{\mathrm{T}}y=\sup\limits_{\left\|y\right\|^{*}\leq 1}x^{\mathrm{T}}y=\left\|x\right\|_{*}=h(x).

In particular, when h()=h(\cdot)=\|\cdot\|, it is sufficient to take ρ1\rho\geq 1.

Remark 2

For the linear inequality constrained problems (2) with h(x)=𝕀{x0}h(x)=\mathbb{I}_{\{x\leq 0\}}, Lemma 2 indicates that hρ(Axb)=ρ[Axb]+.h_{\rho}(Ax-b)=\rho\left\|[Ax-b]_{+}\right\|. Therefore, we have

SubOpt𝖢(x)=f(x)f(x)+ρ[Axb]+,\displaystyle\operatorname{SubOpt}_{\mathsf{C}}(x)=f(x)-f(x^{\star})+\rho\left\|[Ax-b]_{+}\right\|,

Let xx^{\star} and λ\lambda^{\star} be the optimal primal and dual solutions respectively, it is a standard lemma that (see e.g. (zhu2022unified, , Lemma 3)) when ρ2λ\rho\geq 2\left\|\lambda^{\star}\right\|, we have

max{|f(x)f(x)|,ρ[Axb]+}2SubOpt𝖢(x).\displaystyle\max\left\{\left|f(x)-f(x^{\star})\right|,\rho\left\|[Ax-b]_{+}\right\|\right\}\leq 2\operatorname{SubOpt}_{\mathsf{C}}(x).

Similarly, for linear equality constrained problems with h(x)=𝕀{0}(x)h(x)=\mathbb{I}_{\{0\}}(x), we have hρ(Axb)=ρAxbh_{\rho}(Ax-b)=\rho\|Ax-b\|. Then max{|f(x)f(x)|,ρ[Axb]}2SubOpt𝖢(x)\max\left\{\left|f(x)-f(x^{\star})\right|,\rho\left\|[Ax-b]\right\|\right\}\leq 2\operatorname{SubOpt}_{\mathsf{C}}(x), if ρλ\rho\geq\|\lambda^{\star}\|. Therefore, SubOpt𝖢(x)\operatorname{SubOpt}_{\mathsf{C}}(x) essentially agrees with the widely used suboptimality measure of convex linear constrained problems, see e.g. xu2021first ; zhu2022unified ; hamedani2021primal . Furthermore, we can also cover the problem formulation that with both equality and inequality constraints studied in zhang2022global ; zhang2020proximal , if we let h()=𝕀{0}×{x0}()h(\cdot)=\mathbb{I}_{\{0\}\times\{x\leq 0\}}(\cdot).

2.3 Nesterov’s accelerated gradient descent

In this paper, we will frequently use Nesterov’s accelerated gradient descent (AGD) as a subroutine, as stated in Algorithm 1. It is the optimal first-order algorithm for the smooth strongly convex optimization problems nesterov2018lectures .

1 Input: objective function Ψ\Psi, Lipschitz constant LL, strong convexity parameter μ\mu, initial point y0y_{0} and tolerance δ>0\delta>0.
2 Initialize: κLμ,θκ1κ+1,y1y0,k0\kappa\leftarrow\frac{L}{\mu},\theta\leftarrow\frac{\sqrt{\kappa}-1}{\sqrt{\kappa}+1},y_{-1}\leftarrow y_{0},k\leftarrow 0.
3 repeat
4       y~k+1yk+θ(ykyk1)\tilde{y}_{k+1}\leftarrow y_{k}+\theta(y_{k}-y_{k-1}).
5       yk+1y~k+11LΨ(y~k+1)y_{k+1}\leftarrow\tilde{y}_{k+1}-\frac{1}{L}\nabla\Psi(\tilde{y}_{k+1}).
6       kk+1k\leftarrow k+1.
7until Ψ(yT)μδ\left\|\nabla\Psi(y_{T})\right\|\leq\mu\delta;
Output: yTy_{T}.
Algorithm 1 𝖠𝖦𝖣(Ψ,y0,δ)\mathsf{AGD}(\Psi,y_{0},\delta)

The following proposition is a simple corollary of (nesterov2018lectures, , Theorem 2.2.2).

Proposition 1

Assume that Ψ\Psi is LL-smooth and μ\mu-strongly convex. Denote κ=Lμ\kappa=\frac{L}{\mu} and y=argminyΨ(y)y^{\star}=\operatorname*{arg\,min}_{y}\Psi(y). Algorithm 1 𝖠𝖦𝖣(Ψ,y0,δ)\mathsf{AGD}(\Psi,y_{0},\delta) terminates in

T2κlog(2κy0yδ)\displaystyle T\leq 2\sqrt{\kappa}\log\left(\frac{2\kappa\left\|y_{0}-y^{\star}\right\|}{\delta}\right)

iterations, and the output yTy_{T} satisfies yTyδ\left\|y_{T}-y^{\star}\right\|\leq\delta.

3 Upper bounds

3.1 Strongly convex case: Intuition

Though 𝐩𝐫𝐨𝐱h(x)\mathbf{prox}_{h}(x) is easy to evaluate, h(Axb)h(Ax-b) may not have efficient proximal mapping. Therefore, by utilizing the conjugate function of hh, we decouple the composite structure by reformulating (1) as a convex-concave saddle point problem

minxmaxλ(x,λ):=f(x)+λT(Axb)h(λ).\min_{x}\max_{\lambda}\ \mathcal{L}(x,\lambda)\mathrel{\mathop{:}}=f(x)+\lambda^{\mathrm{T}}(Ax-b)-h^{\star}(\lambda). (7)

Switching the order of min\min and max\max, and minimizing with respect to xx, we obtain the dual problem of (1):

maxλΦ(λ):=f(ATλ)bTλh(λ).\max_{\lambda}\ \Phi(\lambda)\mathrel{\mathop{:}}=-f^{\star}(-A^{\mathrm{T}}\lambda)-b^{\mathrm{T}}\lambda-h^{\star}(\lambda). (8)

The following lemma illustrates that the dual problem is strongly concave.

Lemma 3

Φ(λ)\Phi(\lambda) is μΦ\mu_{\Phi}-strongly concave with μΦ:=μA2/Lf\mu_{\Phi}\mathrel{\mathop{:}}=\mu_{A}^{2}/L_{f}.

Proof

Note that f(x)f(x) is LfL_{f}-smooth and μf\mu_{f}-strongly convex, Lemma 1 indicates that f()f^{\star}(\cdot) is 1μf\frac{1}{\mu_{f}}-smooth and 1Lf\frac{1}{L_{f}}-strongly convex. Denote f~(λ):=f(ATλ)\tilde{f}(\lambda)\mathrel{\mathop{:}}=f^{\star}(-A^{\mathrm{T}}\lambda), then for any λ,λm\lambda,\lambda^{\prime}\in\mathbb{R}^{m}, we have

f~(λ)f~(λ)\displaystyle\tilde{f}(\lambda^{\prime})-\tilde{f}(\lambda) f(ATλ),ATλ+ATλ+12LfATλATλ2\displaystyle\geq\langle\nabla f^{\star}(-A^{\mathrm{T}}\lambda),-A^{\mathrm{T}}\lambda^{\prime}+A^{\mathrm{T}}\lambda\rangle+\frac{1}{2L_{f}}\|A^{\mathrm{T}}\lambda-A^{\mathrm{T}}\lambda^{\prime}\|^{2}
=f~(λ),λλ+12LfATλATλ2\displaystyle=\langle\nabla\tilde{f}(\lambda),\lambda^{\prime}-\lambda\rangle+\frac{1}{2L_{f}}\|A^{\mathrm{T}}\lambda-A^{\mathrm{T}}\lambda^{\prime}\|^{2}
f~(λ),λλ+μA22Lfλλ2,\displaystyle\geq\langle\nabla\tilde{f}(\lambda),\lambda^{\prime}-\lambda\rangle+\frac{\mu_{A}^{2}}{2L_{f}}\|\lambda-\lambda^{\prime}\|^{2},

which implies that f~(λ)\tilde{f}(\lambda) is (μA2/Lf)(\mu_{A}^{2}/L_{f})-strongly convex. Combining the fact that bTλ+h(λ)b^{\mathrm{T}}\lambda+h^{\star}(\lambda) is convex, we complete the proof. ∎

One can observe that the Lipschitz continuity of f(x)\nabla f(x) is transferred to the strong concavity of Φ(λ)\Phi(\lambda) through the matrix AA. Therefore, a linear convergence can be expected. To exploit this observation, we perform an inexact proximal point algorithm to solve the dual problem (8):

λkargmaxλΦk(λ):=Φ(λ)2λλk12,\lambda_{k}\approx\operatorname*{arg\,max}_{\lambda}\ \Phi_{k}(\lambda)\mathrel{\mathop{:}}=\Phi(\lambda)-\frac{\ell}{2}\|\lambda-\lambda_{k-1}\|^{2}, (9)

where λk1\lambda_{k-1} is the iterate of the (k1)(k-1)-th step. Now it remains to solve the subproblem (9) efficiently. By expanding the term f(ATλ)f^{\star}(-A^{\mathrm{T}}\lambda) through the conjugate function again, we can rewrite (9) into the equivalent saddle point problem

maxλminxk(x,λ):=f(x)+λT(Axb)h(λ)2λλk12.\max_{\lambda}\min_{x}\ {\mathcal{L}}_{k}(x,\lambda)\mathrel{\mathop{:}}=f(x)+\lambda^{\mathrm{T}}(Ax-b)-h^{\star}(\lambda)-\frac{\ell}{2}\|\lambda-\lambda_{k-1}\|^{2}. (10)

Let gk(λ):=h(λ)+2λλk12g_{k}(\lambda)\mathrel{\mathop{:}}=h^{\star}(\lambda)+\frac{\ell}{2}\|\lambda-\lambda_{k-1}\|^{2}. Then the dual problem of (9) is given by

minxΨk(x):=f(x)+gk(Axb).\min_{x}\ \Psi_{k}(x)\mathrel{\mathop{:}}=f(x)+g_{k}^{\star}(Ax-b). (11)

The function Ψk()\Psi_{k}(\cdot) is LΨL_{\Psi}-smooth and μf\mu_{f}-strongly convex, with LΨ=Lf+LA2L_{\Psi}=L_{f}+\frac{L_{A}^{2}}{\ell}. This time, the strong convexity induced by the proximal term for the dual variable λ\lambda is transferred to the Lipschitz smoothness in the primal variable xx. With a few more computations, we show that Ψk\nabla\Psi_{k} can be easily evaluated, and hence AGD can be applied to solve (11).

Proposition 2

The gradient Ψk()\nabla\Psi_{k}(\cdot) can be evaluated with one call of AA, one call of ATA^{\mathrm{T}}, one call of 𝐩𝐫𝐨𝐱h()\mathbf{prox}_{\ell h}(\cdot), and one call of f()\nabla f(\cdot), respectively.

Proof

With λk(x):=argmaxλk(x,λ)\lambda_{k}^{\star}(x)\mathrel{\mathop{:}}=\operatorname*{arg\,max}_{\lambda}\mathcal{L}_{k}(x,\lambda), Danskin’s theorem indicates that

Ψk(x)=xk(x,λk(x))=f(x)+ATλk(x).\nabla\Psi_{k}(x)=\nabla_{x}\mathcal{L}_{k}(x,\lambda_{k}^{\star}(x))=\nabla f(x)+A^{\mathrm{T}}\lambda_{k}^{\star}(x).

In fact, λk(x)\lambda_{k}^{\star}(x) can be explicitly written as

λk(x)=argmaxλ{h(λ)2λλk1Axb2}=𝐩𝐫𝐨𝐱h(λk1+Axb)=1𝐩𝐫𝐨𝐱h(λk1+Axb)λk1Axb,\begin{split}\lambda_{k}^{\star}(x)&=\operatorname*{arg\,max}_{\lambda}\ \left\{-h^{\star}(\lambda)-{\frac{\ell}{2}}\left\|\lambda-\lambda_{k-1}-\frac{Ax-b}{\ell}\right\|^{2}\right\}\\ &=\mathbf{prox}_{\frac{h^{\star}}{\ell}}\left(\lambda_{k-1}+\frac{Ax-b}{\ell}\right)\\ &=\frac{1}{\ell}\mathbf{prox}_{{\ell}h}\left({\ell}\lambda_{k-1}+Ax-b\right)-\lambda_{k-1}-\frac{Ax-b}{\ell},\end{split} (12)

where the last equality comes from Moreau’s decomposition theorem (showalter1997monotone, , Proposition IV.1.8). ∎

Therefore, Ψk(x)\nabla\Psi_{k}(x) can be easily evaluated and we can apply Algorithm 1 to efficiently obtain xkargminxΨk(x)x_{k}\approx\operatorname*{arg\,min}_{x}\Psi_{k}(x) and then update λk=λk(xk)\lambda_{k}=\lambda_{k}^{\star}(x_{k}) by (12), which will be proved to be an approximate solution of the subproblem (9).

3.2 Strongly convex case: Analysis

Now, we are ready to give the complete algorithm for strongly convex problems.

1 Input: initial point x0,λ0x_{0},\lambda_{0}, smoothness parameter LfL_{f}, strong convexity parameter μf\mu_{f}, minimal singular value μA\mu_{A}, regularization parameter μΦ\ell\geq\mu_{\Phi}, radius parameter DD.
2 Initialize: ρ=μΦ12,δk=(1ρ)k2D\rho=\frac{\mu_{\Phi}}{12\ell},\delta_{k}=(1-\rho)^{\frac{k}{2}}D.
3 for k=1,,Tk=1,\dots,T do
4       Update xk𝖠𝖦𝖣(Ψk,xk1,δk)x_{k}\leftarrow\mathsf{AGD}(\Psi_{k},x_{k-1},\delta_{k}).
5      Update λk\lambda_{k} by (12): λk𝐩𝐫𝐨𝐱h(λk1+Axkb).\lambda_{k}\leftarrow\mathbf{prox}_{\frac{h^{\star}}{\ell}}\left(\lambda_{k-1}+\frac{Ax_{k}-b}{\ell}\right).
Output: xTx_{T}.
Algorithm 2 Inexact PPA for strongly convex problems
Proposition 3

Suppose that f(x)f(x) is LfL_{f}-smooth and μf\mu_{f}-strongly convex, the matrix AA satisfies ALA\left\|A\right\|\leq L_{A} and the minimum singular value of AA is no smaller than μA\mu_{A}. Let (x,λ)(x^{\star},\lambda^{\star}) be the pair of optimal primal and dual variables, and {(xk,λk)}\{(x_{k},\lambda_{k})\} be the iterate sequence generated by Algorithm 2. Assume that μΦ\ell\geq\mu_{\Phi}, then we have for any 0kT0\leq k\leq T, it holds that

λkλ(1ρ)t/2M,\left\|\lambda_{k}-\lambda^{\star}\right\|\leq(1-\rho)^{t/2}\cdot M, (13)

where M=max{λ0λ,10LAμΦ1D}M=\max\left\{\left\|\lambda_{0}-\lambda^{\star}\right\|,10L_{A}\mu_{\Phi}^{-1}D\right\}. If we denote xk=argminxΨk(x)x_{k}^{\star}=\operatorname*{arg\,min}_{x}\Psi_{k}(x), then for any 2kT2\leq k\leq T, it holds that

xk1xk\displaystyle\left\|x_{k-1}-x^{\star}_{k}\right\|\leq LAμfλk1λk2+δk1,\displaystyle~{}\frac{L_{A}}{\mu_{f}}\left\|\lambda_{k-1}-\lambda_{k-2}\right\|+\delta_{k-1}, (14)
xkx\displaystyle\left\|x_{k}-x^{\star}\right\|\leq LAμfλk1λ+δk.\displaystyle~{}\frac{L_{A}}{\mu_{f}}\left\|\lambda_{k-1}-\lambda^{\star}\right\|+\delta_{k}. (15)
Proof

Let λk=argmaxλΦk(λ)\lambda^{\star}_{k}=\operatorname*{arg\,max}_{\lambda}\Phi_{k}(\lambda) and xk=argminxΨk(x)x_{k}^{\star}=\operatorname*{arg\,min}_{x}\Psi_{k}(x). Recall that the output xkx_{k} of Algorithm 𝖠𝖦𝖣(Ψk,xk1,δk)\mathsf{AGD}(\Psi_{k},x_{k-1},\delta_{k}) guarantees that xkxkδk\left\|x_{k}-x^{\star}_{k}\right\|\leq\delta_{k}. According to the dual update rule (12) and the optimality of (xk,λk)(x^{\star}_{k},\lambda^{\star}_{k}), we have

λk=𝐩𝐫𝐨𝐱h(λk1+Axkb),λk=𝐩𝐫𝐨𝐱h(λk1+Axkb).\lambda_{k}^{\star}=\mathbf{prox}_{\frac{h^{\star}}{\ell}}\left(\lambda_{k-1}+\frac{Ax_{k}^{\star}-b}{\ell}\right),\quad\lambda_{k}=\mathbf{prox}_{\frac{h^{\star}}{\ell}}\left(\lambda_{k-1}+\frac{Ax_{k}-b}{\ell}\right).

Thus, we can deduce λkλkAxkAxkLAδk\left\|\lambda_{k}-\lambda^{\star}_{k}\right\|\leq\left\|\frac{Ax_{k}-Ax^{\star}_{k}}{\ell}\right\|\leq\frac{L_{A}\delta_{k}}{\ell} due to the non-expansiveness of proximal operator. By the definition of λk\lambda^{\star}_{k}, it holds that

0Φk(λk)=Φ(λk)+(λkλk1),\displaystyle 0\in\partial\Phi_{k}(\lambda^{\star}_{k})=\partial\Phi(\lambda^{\star}_{k})+\ell(\lambda^{\star}_{k}-\lambda_{k-1}),

which implies that there exists ΦkΦ(λk)\Phi^{\prime}_{k}\in\partial\Phi(\lambda^{\star}_{k}) satisfying Φk+(λkλk1)=0\Phi^{\prime}_{k}+\ell(\lambda^{\star}_{k}-\lambda_{k-1})=0. Combining the fact that Φ\Phi is μΦ\mu_{\Phi}-strongly concave and λ=argmaxλΦ(λ)\lambda^{\star}=\operatorname*{arg\,max}_{\lambda}\Phi(\lambda), we have

μΦλkλ2\displaystyle\mu_{\Phi}\left\|\lambda^{\star}_{k}-\lambda^{\star}\right\|^{2}\leq Φk,λkλ=λk1λk,λkλ\displaystyle~{}\left\langle{\Phi^{\prime}_{k}},{\lambda^{\star}_{k}-\lambda^{\star}}\right\rangle=\ell\left\langle{\lambda_{k-1}-\lambda^{\star}_{k}},{\lambda^{\star}_{k}-\lambda^{\star}}\right\rangle
=\displaystyle= 2(λk1λ2λkλ2λkλk12).\displaystyle~{}\frac{\ell}{2}\left(\left\|\lambda_{k-1}-\lambda^{\star}\right\|^{2}-\left\|\lambda^{\star}_{k}-\lambda^{\star}\right\|^{2}-\left\|\lambda^{\star}_{k}-\lambda_{k-1}\right\|^{2}\right).

Hence, for any constant c>0c>0, we have

+2μΦλk1λ2λkλ2(1+c)1λkλ2c1λkλk2.\displaystyle\frac{\ell}{\ell+2\mu_{\Phi}}\left\|\lambda_{k-1}-\lambda^{\star}\right\|^{2}\geq\left\|\lambda^{\star}_{k}-\lambda^{\star}\right\|^{2}\geq(1+c)^{-1}\left\|\lambda_{k}-\lambda^{\star}\right\|^{2}-c^{-1}\left\|\lambda^{\star}_{k}-\lambda_{k}\right\|^{2}.

Picking c=μΦ6c=\frac{\mu_{\Phi}}{6\ell} and utilizing the assumption that μΦ\ell\geq\mu_{\Phi} yields that

λkλ2(1μΦ6)λk1λ2+7μΦ(C1δk)2,\displaystyle\left\|\lambda_{k}-\lambda^{\star}\right\|^{2}\leq\left(1-\frac{\mu_{\Phi}}{6\ell}\right)\left\|\lambda_{k-1}-\lambda^{\star}\right\|^{2}+\frac{7\ell}{\mu_{\Phi}}(C_{1}\delta_{k})^{2},

where we denote C1=LA/C_{1}=L_{A}/\ell. Therefore, by recursively applying the inequality above, we have

λkλ2(12ρ)kλ0λ2+712ρt=1k(12ρ)kt(C1δt)2.\left\|\lambda_{k}-\lambda^{\star}\right\|^{2}\leq(1-2\rho)^{k}\left\|\lambda_{0}-\lambda^{\star}\right\|^{2}+\frac{7}{12\rho}\sum_{t=1}^{k}(1-2\rho)^{k-t}(C_{1}\delta_{t})^{2}.

By our choice of δt=(1ρ)t/2D\delta_{t}=(1-\rho)^{t/2}D, it holds

t=1k(12ρ)kt(δt)2[(1ρ)k(12ρ)k]D2ρ.\displaystyle\sum_{t=1}^{k}(1-2\rho)^{k-t}(\delta_{t})^{2}\leq\left[(1-\rho)^{k}-(1-2\rho)^{k}\right]\cdot\frac{D^{2}}{\rho}.

Putting these pieces together and plugging in the definition of C1C_{1} and ρ\rho, we get (13).

Recall the definition of Ψk\Psi_{k} given in (11). We can it rewrite it into

Ψk(x)=f(x)+h^(Axb+λk1)2λk12,\displaystyle\Psi_{k}(x)=f(x)+\hat{h}(Ax-b+\ell\lambda_{k-1})-\frac{\ell}{2}\|\lambda_{k-1}\|^{2},

where h^\hat{h} is defined by

h^(u)=maxλ(λTuh(λ)2λ2).\displaystyle\hat{h}(u)=\max_{\lambda}\left(\lambda^{\mathrm{T}}u-h^{\star}(\lambda)-\frac{\ell}{2}\left\|\lambda\right\|^{2}\right).

Note that h^(u)\hat{h}(u) is the conjugate function of h(λ)+2λ2h^{\star}(\lambda)+\frac{\ell}{2}\|\lambda\|^{2}, and hence it follows from Lemma 1 that h^(u)\hat{h}(u) is 1\frac{1}{\ell}-smooth. By definition, we know Ψk(xk)=Ψk1(xk1)\nabla\Psi_{k}(x^{\star}_{k})=\nabla\Psi_{k-1}(x^{\star}_{k-1}), and hence

μfxkxk1\displaystyle\mu_{f}\left\|x^{\star}_{k}-x^{\star}_{k-1}\right\|\leq Ψk(xk)Ψk(xk1)=Ψk1(xk1)Ψk(xk1)\displaystyle~{}\left\|\nabla\Psi_{k}(x^{\star}_{k})-\nabla\Psi_{k}(x^{\star}_{k-1})\right\|=\left\|\nabla\Psi_{k-1}(x^{\star}_{k-1})-\nabla\Psi_{k}(x^{\star}_{k-1})\right\|
=\displaystyle= ATh^(Axk1b+λk2)ATh^(Axk1b+λk1)\displaystyle~{}\left\|A^{\mathrm{T}}\nabla\hat{h}(Ax^{\star}_{k-1}-b+\ell\lambda_{k-2})\!-\!A^{\mathrm{T}}\nabla\hat{h}(Ax^{\star}_{k-1}-b+\ell\lambda_{k-1})\right\|
\displaystyle\leq A1λk2λk1=Aλk2λk1,\displaystyle~{}\left\|A\right\|\cdot\frac{1}{\ell}\cdot\ell\left\|\lambda_{k-2}-\lambda_{k-1}\right\|=\left\|A\right\|\left\|\lambda_{k-2}-\lambda_{k-1}\right\|,

where the first inequality follows from the μf\mu_{f}-strong-convexity of ff, and the last inequality holds because h^\hat{h} is 1\frac{1}{\ell}-smooth. Combining the fact that xk1xkδk+xk1xk\left\|x_{k-1}-x^{\star}_{k}\right\|\leq\delta_{k}+\left\|x^{\star}_{k-1}-x^{\star}_{k}\right\|, we get (14).

Similarly, by the optimality of (x,λ)(x^{\star},\lambda^{\star}), we have xx^{\star} is the minimum of

Ψ(x):=f(x)+h^(Axb+λ)2λ2.\displaystyle\Psi_{\star}(x)\mathrel{\mathop{:}}=f(x)+\hat{h}(Ax-b+\ell\lambda^{\star})-\frac{\ell}{2}\|\lambda^{\star}\|^{2}.

Repeating the argument above yields (15). ∎

Theorem 3.1

Under the same assumption of Proposition 3 and x0xD\left\|x_{0}-x^{\star}\right\|\leq D, in each iteration of AGD, the number of inner iteration TkT_{k} of 𝖠𝖦𝖣(Ψk,xk1,δk)\mathsf{AGD}(\Psi_{k},x_{k-1},\delta_{k}) can be upper bound by

Tk8Lf+1LA2μflog(10κfκADD),\displaystyle T_{k}\leq 8\sqrt{\frac{L_{f}+\ell^{-1}L_{A}^{2}}{\mu_{f}}}\log\left(\frac{10\kappa_{f}\kappa_{A}D_{\star}}{D}\right),

where D:=x0x+LALfλ0λD_{\star}\mathrel{\mathop{:}}=\left\|x_{0}-x^{\star}\right\|+\frac{L_{A}}{L_{f}}\left\|\lambda_{0}-\lambda^{\star}\right\|. Furthermore, for Algorithm 2 to find an approximate solution xTx_{T} satisfying xTxϵ\left\|x_{T}-x^{\star}\right\|\leq\epsilon, the number of outer iterations is

T12μΦlog(100κfκADϵ).\displaystyle T\leq\frac{12\ell}{\mu_{\Phi}}\log\left(100\kappa_{f}\kappa_{A}\cdot\frac{D}{\epsilon}\right).

One comment here is that although we assume knowing an upper bound DD of the distance from x0x_{0} to the optimal solution xx^{*}, it only appears in the logarithmic terms. Hence one can use a very loose upper estimation of the distance in the algorithm without deteriorating the performance.

Proof

We first consider the case k2k\geq 2. By combining (14) and 13, we have

xk1xkLAμf2(1ρ)k/21M+δk1(1ρ)k/2(2D+3LAμfM).\displaystyle\left\|x_{k-1}-x^{\star}_{k}\right\|\leq\frac{L_{A}}{\mu_{f}}\cdot 2(1-\rho)^{k/2-1}M+\delta_{k-1}\leq(1-\rho)^{k/2}\left(2D+\frac{3L_{A}}{\mu_{f}}M\right).

where the last inequality holds because 0<ρ1120<\rho\leq\frac{1}{12}. Therefore, by Proposition 1 and our choice δk=(1ρ)k/2D\delta_{k}=(1-\rho)^{k/2}D, the inner number of steps TkT_{k} of 𝖠𝖦𝖣(Ψk,xk1,δk)\mathsf{AGD}(\Psi_{k},x_{k-1},\delta_{k}) can be upper bound by

Tk\displaystyle T_{k}\leq 2LΨμΨlog(2LΨμΨxk1xkδk)\displaystyle~{}2\sqrt{\frac{L_{\Psi}}{\mu_{\Psi}}}\log\left(\frac{2L_{\Psi}}{\mu_{\Psi}}\cdot\frac{\left\|x_{k-1}-x^{\star}_{k}\right\|}{\delta_{k}}\right)
\displaystyle\leq 2LΨμΨlog(2LΨμΨ(2+3LAμfMD))\displaystyle~{}2\sqrt{\frac{L_{\Psi}}{\mu_{\Psi}}}\log\left(\frac{2L_{\Psi}}{\mu_{\Psi}}\cdot\left(2+\frac{3L_{A}}{\mu_{f}}\frac{M}{D}\right)\right)
\displaystyle\leq 8Lf+1LA2μflog(10κfκA(1+LAλ0λLfD)).\displaystyle~{}8\sqrt{\frac{L_{f}+\ell^{-1}L_{A}^{2}}{\mu_{f}}}\log\left(10\kappa_{f}\kappa_{A}\left(1+\frac{L_{A}\left\|\lambda_{0}-\lambda^{\star}\right\|}{L_{f}D}\right)\right).

where the last inequality follows from plugging in the definition of MM, LΨL_{\Psi} and μΨ=μf\mu_{\Psi}=\mu_{f}. The case k=1k=1 follows similarly.

By combining (15) and (13), we have

xTxLAμfλT1λ+δT(1ρ)T/2(D+3LAμfM).\displaystyle\left\|x_{T}-x^{\star}\right\|\leq\frac{L_{A}}{\mu_{f}}\left\|\lambda_{T-1}-\lambda^{\star}\right\|+\delta_{T}\leq(1-\rho)^{T/2}\left(D+\frac{3L_{A}}{\mu_{f}}M\right).

The desired result follows immediately. ∎

Corollary 1

Suppose that the same assumptions of Proposition 3 hold and DDD_{\star}\leq D. In order to find an approximate solution xTx_{T} satisfying SubOpt𝖲𝖢(xT)ϵ\operatorname{SubOpt}_{\mathsf{SC}}(x_{T})\leq\epsilon, the total number of gradient evaluations for Algorithm 2 is bounded by

O~(κAκflog(D/ϵ)).\tilde{O}\left(\kappa_{A}\sqrt{\kappa_{f}}\log(D/\epsilon)\right).
Proof

According to Theorem 3.1, if we let =μΦ\ell=\mu_{\Phi}, then the complexity of each inner loop is O~(κAκf)\tilde{O}(\kappa_{A}\sqrt{\kappa_{f}}), the complexity of outer loop is O~(log(D/ϵ))\tilde{O}\left(\log(D/\epsilon)\right). Therefore, the overall complexity is O~(κAκflog(D/ϵ))\tilde{O}\left(\kappa_{A}\sqrt{\kappa_{f}}\log(D/\epsilon)\right).

3.3 Non-convex case

For non-convex problems, our algorithm is present in Algorithm 3, which employs the inexact proximal point algorithm in the outer iterations while solving the strongly convex subproblems via Algorithm 2.

1 Input: initial point x0x_{0}, smoothness parameter LfL_{f}, condition number κA\kappa_{A}, subproblem error tolerance {δk}\{\delta_{k}\} and the maximum iteration number TT.
2 for k=1,,Tk=1,\cdots,T do
3       Apply Algorithm 2 to find
xkxk:=argminx{F(x)+Lfxxk12},x_{k}\approx x_{k}^{\star}:=\operatorname*{arg\,min}_{x}\left\{F(x)+L_{f}\|x-x_{k-1}\|^{2}\right\}, (16)
such that xkxkδk\|x_{k}-x_{k}^{\star}\|\leq\delta_{k}.
Output: {xk}k=1T\{x_{k}\}_{k=1}^{T}.
Algorithm 3 Inexact PPA for non-convex problems

Under a suitable sequence of {δk}\{\delta_{k}\}, we provide the convergence rate of Algorithm 3 in the following theorem.

Theorem 3.2

Suppose that f(x)f(x) is LfL_{f}-smooth, the condition number of AA is κA\kappa_{A}. Assume that F(x1)infxF(x)ΔF(x_{1}^{\star})-\inf_{x}F(x)\leq\Delta^{\prime}, δk=Δ/Lf2k\delta_{k}=\frac{\sqrt{\Delta^{\prime}/L_{f}}}{2k} and {xk}\{x_{k}\} be the iterate sequence generated by Algorithm 3. Then we have

min0k<TSubOpt(xk)5LfΔT.\min_{0\leq k<T}\mathrm{SubOpt}(x_{k})\leq\sqrt{\frac{5L_{f}\Delta^{\prime}}{T}}.

In particular, suppose that the initial point x0domFx_{0}\in\mathrm{dom}F and F(x0)infxF(x)ΔF(x_{0})-\inf_{x}F(x)\leq\Delta for some Δ>0\Delta>0, then taking Δ=Δ\Delta^{\prime}=\Delta yields

min0k<TSubOpt(xk)5LfΔT.\min_{0\leq k<T}\mathrm{SubOpt}(x_{k})\leq\sqrt{\frac{5L_{f}\Delta}{T}}. (17)
Proof

By the definition of xk+1x_{k+1}^{\star}, on the one hand, we have F(xk+1)+Lfxk+1xk2F(xk)+Lfxkxk2F(x_{k+1}^{\star})+L_{f}\|x_{k+1}^{\star}-x_{k}\|^{2}\leq F(x_{k}^{\star})+L_{f}\|x_{k}^{\star}-x_{k}\|^{2}, which yields

xk+1xk21Lf(F(xk)F(xk+1))+δk2.\|x_{k+1}^{\star}-x_{k}\|^{2}\leq\frac{1}{L_{f}}(F(x_{k}^{\star})-F(x_{k+1}^{\star}))+\delta_{k}^{2}. (18)

On the other hand, it holds

xk+1=𝐩𝐫𝐨𝐱1Lfh(Ab)(xk12Lff(xk+1)).x_{k+1}^{\star}=\mathbf{prox}_{\frac{1}{L_{f}}h(A\cdot-b)}\left(x_{k}-\frac{1}{2L_{f}}\nabla f(x_{k+1}^{\star})\right).

Therefore, we have

xk𝐩𝐫𝐨𝐱1Lfh(Ab)(xk12Lff(xk))\displaystyle\left\|x_{k}-\mathbf{prox}_{\frac{1}{L_{f}}h(A\cdot-b)}\left(x_{k}-\frac{1}{2L_{f}}\nabla f(x_{k})\right)\right\|
\displaystyle\leq xkxk+1+xk+1𝐩𝐫𝐨𝐱1Lfh(Ab)(xk12Lff(xk))\displaystyle\ \|x_{k}-x_{k+1}^{\star}\|+\left\|x_{k+1}^{\star}-\mathbf{prox}_{\frac{1}{L_{f}}h(A\cdot-b)}\left(x_{k}-\frac{1}{2L_{f}}\nabla f(x_{k})\right)\right\|
\displaystyle\leq 2xkxk+1,\displaystyle\ 2\|x_{k}-x_{k+1}^{\star}\|,

where the last inequality holds because the proximal operator is non-expansive and f()f(\cdot) is LfL_{f}-smooth. Combining with (18), we obtain

Lf2xk𝐩𝐫𝐨𝐱1Lfh(Ab)(xk12Lff(xk))2\displaystyle L_{f}^{2}\left\|x_{k}-\mathbf{prox}_{\frac{1}{L_{f}}h(A\cdot-b)}\left(x_{k}-\frac{1}{2L_{f}}\nabla f(x_{k})\right)\right\|^{2}
4Lf2xkxk+124Lf(F(xk)F(xk+1))+4Lf2δk2.\displaystyle\leq 4L_{f}^{2}\|x_{k}-x_{k+1}^{\star}\|^{2}\leq 4L_{f}(F(x_{k}^{\star})-F(x_{k+1}^{\star}))+4L_{f}^{2}\delta_{k}^{2}.

Summing up the above inequality over t=1,,Tt=1,\cdots,T,

1Tk=1TLf2xk𝐩𝐫𝐨𝐱1Lfh(Ab)(xkf(xk)2Lf)24Lf(F(x1)F(xT+1))+LfΔT.\displaystyle\frac{1}{T}\sum_{k=1}^{T}L_{f}^{2}\left\|x_{k}\!-\!\mathbf{prox}_{\frac{1}{L_{f}}h(A\cdot-b)}\left(x_{k}\!-\!\frac{\nabla f(x_{k})}{2L_{f}}\right)\right\|^{2}\!\leq\!\frac{4L_{f}(F(x_{1}^{\star})\!-\!F(x_{T+1}^{\star}))\!+\!L_{f}\Delta^{\prime}}{T}.

Since F(x1)F(xT+1)F(x1)infxF(x)ΔF(x_{1}^{\star})-F(x_{T+1}^{\star})\leq F(x_{1}^{\star})-\inf_{x}F(x)\leq\Delta^{\prime}, it holds that

min1kTLfxk𝐩𝐫𝐨𝐱1Lfh(Ab)(xkf(xk)2Lf)5LfΔT.\min_{1\leq k\leq T}L_{f}\left\|x_{k}-\mathbf{prox}_{\frac{1}{L_{f}}h(A\cdot-b)}\left(x_{k}-\frac{\nabla f(x_{k})}{2L_{f}}\right)\right\|\leq\sqrt{\frac{5L_{f}\Delta^{\prime}}{T}}.

Furthermore, under the assumption of x0domFx_{0}\in\mathrm{dom}F, it holds F(x1)F(x0)F(x_{1}^{\star})\leq F(x_{0}) and accordingly F(x1)infxF(x)F(x0)infxF(x)ΔF(x_{1}^{\star})-\inf_{x}F(x)\leq F(x_{0})-\inf_{x}F(x)\leq\Delta. This completes the proof. ∎

In Theorem 3.2, we present the first inequality by incorporating the definition of Δ\Delta^{\prime} specifically for the scenario where x0domFx_{0}\notin\mathrm{dom}F. This is necessary because in such cases, F(x0)infxF(x)F(x_{0})-\inf_{x}F(x) can be infinite, as exemplified by h(x)h(x) when it is an indicator function. Consequently, it becomes impossible to find a finite Δ\Delta that satisfies F(x0)infxF(x)ΔF(x_{0})-\inf_{x}F(x)\leq\Delta. By introducing the definition of Δ\Delta^{\prime}, we ensure the existence of a finite Δ\Delta^{\prime} and thus establish a well-defined result.

Corollary 2

Under the same assumption and same choice of δk\delta_{k} in Theorem 3.2, in order to find an approximate solution xTx_{T} satisfying SubOpt𝖭𝖢(xT)ϵ\operatorname{SubOpt}_{\mathsf{NC}}(x_{T})\leq\epsilon, the total number of gradient evaluations for Algorithm 3 is bounded by

O~(κALfΔϵ2).\tilde{{O}}\left(\frac{\kappa_{A}L_{f}\Delta^{\prime}}{\epsilon^{2}}\right).

Furthermore, if x0domFx_{0}\in\mathrm{dom}F, the total number is bounded by O~(κALfΔϵ2)\tilde{{O}}\left(\frac{\kappa_{A}L_{f}\Delta}{\epsilon^{2}}\right).

Proof

By Theorem 3.2, to reach the expected precision, we need O(LfΔϵ2)O\left(\frac{L_{f}\Delta^{\prime}}{\epsilon^{2}}\right) outer iterations. For each 1kT1\leq k\leq T, the function f(x)+Lfxxk12f(x)+L_{f}\|x-x_{k-1}\|^{2} is 2Lf2L_{f}-strongly-convex and 3Lf3L_{f}-smooth, and hence its condition number is O(1)O(1). Hence, the number of gradient evaluations in the kk-th inner iteration with Algorithm 2 is O~(κAlog(1/δk))\tilde{O}(\kappa_{A}\log(1/\delta_{k})) by Corollary 1. Combining the complexities of inner and outer loops, we obtain the O~(κALfΔϵ2)\tilde{{O}}\left(\frac{\kappa_{A}L_{f}\Delta^{\prime}}{\epsilon^{2}}\right) overall complexity. For the case x0domFx_{0}\in\mathrm{dom}F, the complexity can be derived similarly.∎

3.4 Convex case

For any given x0x_{0} and ϵ>0\epsilon>0, we construct the following auxiliary problem:

minxf(x)+h(Axb)+ϵ2D2xx02.\min_{x}\ f(x)+h(Ax-b)+\frac{\epsilon}{2D^{2}}\|x-x_{0}\|^{2}. (19)

The smooth part f(x)+ϵ2D2xx02f(x)+\frac{\epsilon}{2D^{2}}\|x-x_{0}\|^{2} is strongly convex and hence we can apply Algorithm 2 to solve the problem. The following corollary illustrates that the approximate solution of (19) is also an approximate solution of the original convex problem and the overall complexity is also optimal.

Corollary 3

Suppose that f(x)f(x) is convex, the condition number of AA is κA\kappa_{A} and x0xD\|x_{0}-x^{\star}\|\leq D. For any given 0<ρ<+0<\rho<+\infty, Algorithm 2 can be applied on problem (19) and output an approximate solution x^\hat{x} satisfying SubOpt𝖢(x^)ϵ\operatorname{SubOpt}_{\mathsf{C}}(\hat{x})\leq\epsilon, such that the total number of gradient evaluations is bounded by

O~(κALfDϵ),\tilde{O}\left(\frac{\kappa_{A}\sqrt{L_{f}}D}{\sqrt{\epsilon}}\right),

where O~\tilde{O} also hides the logarithmic dependence on ρ\rho.

Proof

Denote the exact solution of (19) as xϵx_{\epsilon}^{\star}. We apply Algorithm 2 on (19) and calculate a point x^\hat{x} such that x^xϵδ\left\|\hat{x}-x_{\epsilon}^{\star}\right\|\leq\delta, where we will specify δ\delta later in proof.

By the optimality of xϵx_{\epsilon}^{\star}, it holds that

F(xϵ)+ϵ2D2xϵx02F(x)+ϵ2D2xx02.\displaystyle F(x_{\epsilon}^{\star})+\frac{\epsilon}{2D^{2}}\left\|x_{\epsilon}^{\star}-x_{0}\right\|^{2}\leq F(x^{\star})+\frac{\epsilon}{2D^{2}}\left\|x^{\star}-x_{0}\right\|^{2}.

In particular, we have F(xϵ)F(x)+ϵ/2F(x_{\epsilon}^{\star})\leq F(x^{\star})+\epsilon/2, and by the fact that F(x)F(xϵ)F(x^{\star})\leq F(x_{\epsilon}^{\star}), we know xϵx0xx0D\left\|x_{\epsilon}^{\star}-x_{0}\right\|\leq\left\|x^{\star}-x_{0}\right\|\leq D.

On the other hand, we have

f(x^)\displaystyle f(\hat{x})\leq f(xϵ)+f(xϵ),x^xϵ+Lf2x^xϵ2\displaystyle~{}f(x_{\epsilon}^{\star})+\left\langle{\nabla f(x_{\epsilon}^{\star})},{\hat{x}-x_{\epsilon}^{\star}}\right\rangle+\frac{L_{f}}{2}\left\|\hat{x}-x_{\epsilon}^{\star}\right\|^{2}
\displaystyle\leq f(xϵ)+(f(x0)+LfD)x^xϵ+Lf2x^xϵ2,\displaystyle~{}f(x_{\epsilon}^{\star})+\left(\left\|\nabla f(x_{0})\right\|+L_{f}D\right)\left\|\hat{x}-x_{\epsilon}^{\star}\right\|+\frac{L_{f}}{2}\left\|\hat{x}-x_{\epsilon}^{\star}\right\|^{2},

where the last inequality follows from f(xϵ)f(x0)Lfxϵx0LfD\left\|\nabla f(x_{\epsilon}^{\star})-\nabla f(x_{0})\right\|\leq L_{f}\left\|x_{\epsilon}^{\star}-x_{0}\right\|\leq L_{f}D. Further, since hρ()h_{\rho}(\cdot) is ρ\rho-Lipschitz continuous and hρ()h()h_{\rho}(\cdot)\leq h(\cdot), it holds

hρ(Ax^b)hρ(Axϵb)+ρAx^Axϵh(Axϵb)+ρLAx^xϵ.h_{\rho}(A\hat{x}-b)\leq h_{\rho}(Ax_{\epsilon}^{\star}-b)+\rho\left\|A\hat{x}-Ax_{\epsilon}^{\star}\right\|\leq h(Ax_{\epsilon}^{\star}-b)+\rho L_{A}\left\|\hat{x}-x_{\epsilon}^{\star}\right\|.

Denote Cρ=f(x0)+LfD+ρLAC_{\rho}=\left\|\nabla f(x_{0})\right\|+L_{f}D+\rho L_{A}. Combining the above two inequalities yields

f(x^)+hρ(Ax^b)\displaystyle f(\hat{x})+h_{\rho}(A\hat{x}-b)\leq f(xϵ)+h(Axϵb)+Cρx^xϵ+Lf2x^xϵ2\displaystyle~{}f(x_{\epsilon}^{\star})+h(Ax_{\epsilon}^{\star}-b)+C_{\rho}\left\|\hat{x}-x_{\epsilon}^{\star}\right\|+\frac{L_{f}}{2}\left\|\hat{x}-x_{\epsilon}^{\star}\right\|^{2}
\displaystyle\leq F(xϵ)+Cρδ+Lfδ22\displaystyle~{}F(x_{\epsilon}^{\star})+C_{\rho}\delta+\frac{L_{f}\delta^{2}}{2}
\displaystyle\leq F(x)+ϵ2+Cρδ+Lfδ22.\displaystyle~{}F(x^{\star})+\frac{\epsilon}{2}+C_{\rho}\delta+\frac{L_{f}\delta^{2}}{2}.

Therefore, we can set

δ=min{ϵ3Cρ,ϵ3Lf}\displaystyle\delta=\min\left\{\frac{\epsilon}{3C_{\rho}},\sqrt{\frac{\epsilon}{3L_{f}}}\right\}

to ensure that SubOpt𝖢(x^)ϵ\operatorname{SubOpt}_{\mathsf{C}}(\hat{x})\leq\epsilon. Notice that the function f(x)+ϵ2D2xx02f(x)+\frac{\epsilon}{2D^{2}}\left\|x-x_{0}\right\|^{2} is (Lf+ϵ/D2)(L_{f}+\epsilon/D^{2})-smooth and (ϵ/D2)(\epsilon/D^{2})-strongly convex. Therefore, according to Corollary 1, the required number of gradient evaluations is O~(κALfD2ϵ+1)\tilde{O}\left(\kappa_{A}\sqrt{\frac{L_{f}D^{2}}{\epsilon}+1}\right). ∎

Note that we give some specific definitions of hρh_{\rho} in Property 3 and 2. In the following, we give the complexity results on these specific problems.

Corollary 4

For the conic inequality constrained problem (3) and any fixed ρ>0\rho>0, in order to find an approximate solution xTx_{T} satisfies

|f(xT)f(x)|ϵ,𝒫𝒦(AxTb)ϵρ,\displaystyle|f(x_{T})-f(x^{\star})|\leq\epsilon,\qquad\|\mathcal{P}_{\mathcal{K}^{\circ}}(Ax_{T}-b)\|\leq\frac{\epsilon}{\rho},

the required number of gradient evaluation is O~(κADLf/ϵ)\tilde{O}\left(\kappa_{A}D\sqrt{L_{f}/\epsilon}\right).

Corollary 4 implies that for conic inequality constrained convex problems (including linearly constrained convex problems), the constraint can be fulfilled to arbitrary accuracy without affecting the order of the complexity (up to log factor).

Corollary 5

When h()h(\cdot) is ρ\rho-Lipschitz continuous (e.g., the norm regularized problem (4)), in order to find an approximate solution xTx_{T} satisfies F(xT)minxF(x)ϵF(x_{T})-\min_{x}F(x)\leq\epsilon, the required number of gradient evaluation is O~(κADLf/ϵ)\tilde{O}\left(\kappa_{A}D\sqrt{L_{f}/\epsilon}\right).

4 Lower bounds

4.1 Problem classes and algorithm class

In this section, we aim to construct three hard instances for the strongly convex, convex, and non-convex cases, respectively. First, let us formally define the problem classes and the linear span first-order algorithm class. For the simplicity of presentation, we construct the hard instances with x0=0x_{0}=0 and LA=2L_{A}=2.

Strongly convex problem class.  For positive constants Lfμf>0L_{f}\geq\mu_{f}>0, D,κA>0D,\kappa_{A}>0, the problem class SC(Lf,μf,κA,D)\mathcal{F}_{\mathrm{SC}}(L_{f},\mu_{f},\kappa_{A},D) includes problems in which f(x)f(x) is LfL_{f}-smooth and μf\mu_{f}-strongly convex, x0xD\|x_{0}-x^{\star}\|\leq D, and the condition number of AA is upper bounded by κA\kappa_{A}.

Non-convex problem class.  For positive constants Lf,κA,Δ>0L_{f},\kappa_{A},\Delta>0 and x0domFx_{0}\in\mathrm{dom}F, the problem class NC(Lf,Δ,κA)\mathcal{F}_{\mathrm{NC}}(L_{f},\Delta,\kappa_{A}) includes problems where f(x)f(x) is LfL_{f}-smooth, F(x0)F(x)ΔF(x_{0})-F(x^{\star})\leq\Delta and the condition number of AA is upper bounded by κA\kappa_{A}.

Convex problem class.  For positive constants Lf,D,κA>0L_{f},D,\kappa_{A}>0, the problem class C(Lf,κA,D)\mathcal{F}_{\mathrm{C}}(L_{f},\kappa_{A},D) includes problems in which f(x)f(x) is LfL_{f}-smooth, x0xD\|x_{0}-x^{\star}\|\leq D, and the condition number of AA is upper bounded by κA\kappa_{A}.

For the above three problem classes, we restrict our discussion to first-order linear span algorithms. The results can be extended to the general first-order algorithms without linear span structure by the orthogonal invariance trick proposed in carmon2020lower .

First-order linear span algorithms.  The iterate sequence {(xk,λk)}\{(x_{k},\lambda_{k})\} is generated such that (xk,λk)𝒮k+1x×𝒮k+1λ(x_{k},\lambda_{k})\in\mathcal{S}^{x}_{k+1}\times\mathcal{S}^{\lambda}_{k+1}. These subspaces are generated by starting with 𝒮0x=Span{x0}\mathcal{S}^{x}_{0}=\mathrm{Span}\{x_{0}\}, 𝒮0λ=Span{λ0}\mathcal{S}^{\lambda}_{0}=\mathrm{Span}\{\lambda_{0}\} and

𝒮k+1x:=Span{xi,f(x^i),ATλ^i:x^i𝒮ix,λ^i𝒮iλ,0ik},𝒮k+1λ:=Span{λi,𝐩𝐫𝐨𝐱ηih(λ^i+ηi(Ax^jb)):x^i𝒮ix,λ^i𝒮iλ,0ik}.\begin{split}\mathcal{S}^{x}_{k+1}&\!\mathrel{\mathop{:}}=\!\mathrm{Span}\left\{x_{i},\nabla f(\hat{x}_{i}),A^{\mathrm{T}}\hat{\lambda}_{i}:\forall\hat{x}_{i}\in\mathcal{S}_{i}^{x},\hat{\lambda}_{i}\in\mathcal{S}_{i}^{\lambda},0\leq i\leq k\right\},\\ \mathcal{S}^{\lambda}_{k+1}&\!\mathrel{\mathop{:}}=\!\mathrm{Span}\left\{\lambda_{i},\mathbf{prox}_{\eta_{i}h^{\star}}\left(\hat{\lambda}_{i}+\eta_{i}(A\hat{x}_{j}-b)\right):\forall\hat{x}_{i}\in\mathcal{S}_{i}^{x},\hat{\lambda}_{i}\in\mathcal{S}_{i}^{\lambda},0\leq i\leq k\right\}.\end{split}

If we assume that λ0=0\lambda_{0}=0, then for the linear equality constrained problem (2), 𝐩𝐫𝐨𝐱ηih(x)=x\mathbf{prox}_{\eta_{i}h^{\star}}(x)=x and the algorithm class degenerates into

xk+1𝒮k+1x=Span{xi,f(x^i),AT(Ax^ib),x^i𝒮ix,0ik}.x_{k+1}\in\mathcal{S}^{x}_{k+1}=\mathrm{Span}\left\{x_{i},\nabla f(\hat{x}_{i}),A^{\mathrm{T}}(A\hat{x}_{i}-b),~{}\forall\hat{x}_{i}\in\mathcal{S}_{i}^{x},0\leq i\leq k\right\}.

We can further assume x0=0x_{0}=0 without loss of generality, otherwise, we can consider the shifted problem minxF(xx0)\min_{x}F(x-x_{0}).

Note that for a first-order linear span algorithm, it is not necessary to use the current gradient in each iteration. Instead, it can use any combination of points from the historical search space. This makes the algorithm class general enough to cover diverse iteration schemes. To give some specific examples, we present following single-loop and double-loop algorithms covered under the considered algorithm class.

Example 1 (Single-loop algorithms)

Consider problem (2) with h()=𝕀{0}()h(\cdot)=\mathbb{I}_{\{0\}}(\cdot), the Chambolle-Pock method chambolle2011first , the OGDA method mokhtari2020convergence and the linearized ALM xu2021first update iterates by the following rules

{xk+1=xkη1(f(xk)+ATλk)λk+1=λk+η2(2Axk+1Axkb)\begin{split}\left\{\begin{array}[]{l}x_{k+1}=x_{k}-\eta_{1}\left(\nabla f(x_{k})+A^{\mathrm{T}}\lambda_{k}\right)\\ \lambda_{k+1}=\lambda_{k}+\eta_{2}(2Ax_{k+1}-Ax_{k}-b)\end{array}\right.\end{split} (Chambolle-Pock)
{xk+1=xkη1(2f(xk)f(xk1)+AT(2λkλk1))λk+1=λk+η2(2AxkAxk1b)\begin{split}\left\{\begin{array}[]{l}x_{k+1}=x_{k}-\eta_{1}\left(2\nabla f(x_{k})-\nabla f(x_{k-1})+A^{\mathrm{T}}(2\lambda_{k}-\lambda_{k-1})\right)\\ \lambda_{k+1}=\lambda_{k}+\eta_{2}(2Ax_{k}-Ax_{k-1}-b)\end{array}\right.\end{split} (OGDA)
{xk+1=xkη1(f(xk)+ATλk+ρAT(Axkb))λk+1=λk+η2(Axk+1b)\begin{split}\left\{\begin{array}[]{l}x_{k+1}=x_{k}-\eta_{1}\left(\nabla f(x_{k})+A^{\mathrm{T}}\lambda_{k}+\rho A^{\mathrm{T}}(Ax_{k}-b)\right)\\ \lambda_{k+1}=\lambda_{k}+\eta_{2}(Ax_{k+1}-b)\end{array}\right.\end{split} (Linearized ALM)

where ρ>0\rho>0 is a penalty factor.

For problem class SC(Lf,μf,κA)\mathcal{F}_{SC}(L_{f},\mu_{f},\kappa_{A}), a unified analysis on the above three methods was provided in zhu2022unified , and an O((κf+κA2)log(1ϵ))O\left((\kappa_{f}+\kappa_{A}^{2})\log\left(\frac{1}{\epsilon}\right)\right) complexity is achieved.

Example 2 (Double-loop algorithm xu2021iteration )

Consider problem (2) with h()=𝕀{0}()h(\cdot)=\mathbb{I}_{\{0\}}(\cdot). Let ρ(x,λ)=f(x)+λT(Axb)+ρ2Axb2\mathcal{L}_{\rho}(x,\lambda)=f(x)+\lambda^{\mathrm{T}}(Ax-b)+\frac{\rho}{2}\|Ax-b\|^{2} be the augmented Lagrangian function. ALM generates iterates by

{xk+1argminxρ(x,λk)λk+1=λk+ρ(Axk+1b)\left\{\begin{array}[]{l}x_{k+1}\approx\operatorname*{arg\,min}_{x}\mathcal{L}_{\rho}(x,\lambda_{k})\\ \lambda_{k+1}=\lambda_{k}+\rho(Ax_{k+1}-b)\end{array}\right.

where the subproblem is solved by an inner loop of Nesterov’s AGD method. An O(ϵ1)O(\epsilon^{-1}) complexity for convex problems and an O(ϵ12)O(\epsilon^{-\frac{1}{2}}) complexity for strongly convex problems are derived in xu2021iteration .

Remark 3

It can be checked that all three algorithms proposed in the upper bound section belong to the defined first order linear span algorithm class for general hh.

4.2 The construction of hard instance

In this section, we construct the hard instances for any first-order linear span algorithms. Specifically, we consider the linear equality constrained problem (2) with h()=𝕀{0}()h(\cdot)=\mathbb{I}_{\{0\}}(\cdot). For positive integers NN and dd, we define the following problem

minx2Ndf0(x):=G(x[1],x[N+1])++G(x[N],x[2N]),s.t.x[1]=x[2]==x[2N],\displaystyle\begin{aligned} \min_{x\in\mathbb{R}^{2Nd}}&~{}f_{0}(x)\mathrel{\mathop{:}}=G(x[1],x[N+1])+\cdots+G(x[N],x[2N]),\\ \,\textrm{s.t.}\,&~{}x[1]=x[2]=\cdots=x[2N],\end{aligned} (20)

where x[1],,x[2N]dx[1],\cdots,x[2N]\in\mathbb{R}^{d} and x2Ndx\in\mathbb{R}^{2Nd} is the vector that stacks x[i]x[i] together in order, the component function G(u,v):d×dG(u,v):\mathbb{R}^{d}\times\mathbb{R}^{d}\mapsto\mathbb{R} is a smooth function to be determined later. To ensure that f0(x)f_{0}(x) satisfies the assumptions of different problem classes, we will construct various formulations of G(u,v)G(u,v) in the strongly convex, convex, and non-convex cases, respectively. Additionally, we require G(u,v)G(u,v) to satisfy the following assumption.

Assumption 4.1

For any i0i\geq 0, it holds

  1. 1.

    If supp{u}[i+1],supp{v}[i]\operatorname{supp}\left\{u\right\}\subset[i+1],\operatorname{supp}\left\{v\right\}\subset[i], then supp{uG(u,v)}[i+1]\operatorname{supp}\left\{\nabla_{u}G(u,v)\right\}\subset[i+1] and supp{vG(u,v)}[i]\operatorname{supp}\left\{\nabla_{v}G(u,v)\right\}\subset[i].

  2. 2.

    If supp{u}[i],supp{v}[i]\operatorname{supp}\left\{u\right\}\subset[i],\operatorname{supp}\left\{v\right\}\subset[i], then supp{uG(u,v)}[i+1]\operatorname{supp}\left\{\nabla_{u}G(u,v)\right\}\subset[i+1] and supp{vG(u,v)}[i]\operatorname{supp}\left\{\nabla_{v}G(u,v)\right\}\subset[i].

The constraint in (20) can be rewritten as Ax=0Ax=0 with

A=[IdIdIdIdIdId](2N1)d×2Nd.A=\left[\begin{array}[]{ccccc}I_{d}&-I_{d}&&&\\ &\ddots&\ddots&&\\ &&I_{d}&-I_{d}&\\ &&&I_{d}&-I_{d}\end{array}\right]\in\mathbb{R}^{(2N-1)d\times 2Nd}. (21)

Hence AAT(2N1)d×(2N1)dAA^{\mathrm{T}}\in\mathbb{R}^{(2N-1)d\times(2N-1)d} and ATA2Nd×2NdA^{\mathrm{T}}A\in\mathbb{R}^{2Nd\times 2Nd} can be computed as

AAT=[2IdIdId2IdIdId2IdIdId2Id],ATA=[IdIdId2IdIdId2IdIdIdId].AA^{\mathrm{T}}=\left[\begin{array}[]{ccccc}2I_{d}&-I_{d}&&&\\ -I_{d}&2I_{d}&-I_{d}&&\\ &\ddots&\ddots&\ddots&\\ &&-I_{d}&2I_{d}&-I_{d}\\ &&&-I_{d}&2I_{d}\end{array}\right],~{}A^{\mathrm{T}}A=\left[\begin{array}[]{ccccc}I_{d}&-I_{d}&&&\\ -I_{d}&2I_{d}&-I_{d}&&\\ &\ddots&\ddots&\ddots&\\ &&-I_{d}&2I_{d}&-I_{d}\\ &&&-I_{d}&I_{d}\end{array}\right]. (22)
Lemma 4

For matrix AA defined in (21), its condition number satisfies κA2N21\kappa_{A}\!\leq\!\sqrt{2N^{2}\!-\!1}.

Proof

Note that T=AATT=AA^{\mathrm{T}} is a block tridiagonal Toeplitz matrix and its eigenvalue is 2+2cos(πi2N),1i2N12+2\cos\left(\frac{\pi i}{2N}\right),1\leq i\leq 2N-1 (see noschese2013tridiagonal ). Accordingly, the condition number of TT satisfies

κT=2+2cos(π2N)2+2cos(π(2N1)2N)=1+cos(π2N)1cos(π2N)=1+21cos(π2N)12N21,\kappa_{T}=\frac{2+2\cos\left(\frac{\pi}{2N}\right)}{2+2\cos\left(\frac{\pi(2N-1)}{2N}\right)}=\frac{1+\cos\left(\frac{\pi}{2N}\right)}{1-\cos\left(\frac{\pi}{2N}\right)}=1+\frac{2}{\frac{1}{\cos\left(\frac{\pi}{2N}\right)}-1}\leq 2N^{2}-1,

where the last inequality is due to cos(πx2)1x2\cos\left(\frac{\pi x}{2}\right)\leq 1-x^{2}, x[0,1]\forall x\in\left[0,1\right]. Consequently, κA2N21\kappa_{A}\leq\sqrt{2N^{2}-1}. ∎

Next, we demonstrate the propagation of non-zero entries in this example. For first-order linear span algorithm with b=0b=0, we have

𝒮k+1span{𝒮k{f0(x^k),ATAx^k}}.\mathcal{S}_{k+1}\subset\mathrm{span}\left\{\mathcal{S}_{k}\cup\left\{\nabla f_{0}(\hat{x}_{k}),A^{\mathrm{T}}A\hat{x}_{k}\right\}\right\}.

It implies that new non-zero entries are introduced either through G(x[i],x[N+i])\nabla G(x[i],x[N+i]), or through the action of ATAA^{\mathrm{T}}A on xx. As ATAA^{\mathrm{T}}A is a block tridiagonal matrix, each action of ATAA^{\mathrm{T}}A enables entries in x[i]x[i] to ”communicate” with their neighboring vectors, thereby propagating the non-zero entries.

Figure 1 illustrates how non-zero entries propagate. Assume that the initial point is (x[i])j=0(x[i])_{j}=0 for all ii and jj. In the first iteration, we use Assumption 4.1 to observe that supp{uG(x[j],x[N+j])}[1],1jN\operatorname{supp}\left\{\nabla_{u}G(x[j],x[N+j])\right\}\subset[1],1\leq j\leq N and supp{vG(x[j],x[N+j])}=\operatorname{supp}\left\{\nabla_{v}G(x[j],x[N+j])\right\}=\emptyset, so it is only possible to have (x[1:N])10(x[1:N])_{1}\neq 0. In the second iteration, G(x[i],x[N+i])\nabla G(x[i],x[N+i]) does not introduce any new non-zero entries, but the action of ATAA^{\mathrm{T}}A on xx causes (x[N+1])1(x[N+1])_{1} to receive a non-zero entry from (x[N])1(x[N])_{1}. In the third iteration, we have uG(x[1],x[N+1])[2]\nabla_{u}G(x[1],x[N+1])\subset[2], which allows (x[1])2(x[1])_{2} to become non-zero. Additionally, ATAA^{\mathrm{T}}A propagates the non-zero entry in (x[N+1])1(x[N+1])_{1} to (x[N+2])1(x[N+2])_{1}. By repeating the above propagation mechanism, we can see that by the (N+1)(N+1)th iteration, both (x[1:2N])1(x[1:2N])_{1} and (x[1:N1])2(x[1:N-1])_{2} become nonzero. In the (N+2)(N+2)th iteration, (x[N])2(x[N])_{2} becomes nonzero through uG(x[N],x[2N])[2]\nabla_{u}G(x[N],x[2N])\subset[2]. We can consider iterations 22 to N+2N+2 (which consist of N+1N+1 iterations) as one complete round of iterations. By repeating this process, each round of iteration can convert up to 2N2N elements to nonzero. After i2i-2 rounds of iteration, specifically at the ((i1)(N+1)+1)((i-1)(N+1)+1)th iteration, (x[1:2N])1:i1(x[1:2N])_{1:i-1} and (x[1:N])i(x[1:N])_{i} become possibly nonzero.

(1,1)(1,1)(2,1)(2,1)(N,1)(N,1)(NN+1, 1)(NN+2, 1)(2N2N, 1)(1,2)(1,2)(2,2)(2,2)(N,2)(N,2)(NN+1, 2)(NN+2, 2)(2N2N, 2)(1,3)(1,3)(2,3)(2,3)(N,3)(N,3)(NN+1, 3)(NN+2, 3)(2N2N, 3)\cdots111111223344NN+33NN+44NN+552N2N+442N2N+553344NN+22NN+44NN+552N2N+332N2N+552N2N+663N3N+44\cdots\cdots\cdots\cdots\cdots\cdots
Figure 1: Propagation of nonzero entries. In this figure, the pair (i,j)(i,j) represents the entry (x[i])j(x[i])_{j}. The propagation is indicated by blue arrows when passing through f0(x)\nabla f_{0}(x), and by orange arrows when passing through ATAxA^{\mathrm{T}}Ax. The number of iteration is placed above the arrows.

Based on the above procedure, it is natural to obtain the following Lemma.

Lemma 5

For k=(i1)(N+1)+jk=(i-1)(N+1)+j with 1id1,1jN+11\leq i\leq d-1,1\leq j\leq N+1, we have

supp{xk}{(1:2N,1:i1)(1:N+j1,i)(1:j2,i+1)},\displaystyle\operatorname{supp}\left\{x_{k}\right\}\subset\left\{(1:2N,1:i-1)\cup(1:N+j-1,i)\cup(1:j-2,i+1)\right\},

where the pair (i,j)(i,j) represents the entry (x[i])j(x[i])_{j}. Therefore, for any k>0k>0, let K=k2N+1+1K=\left\lfloor\frac{k-2}{N+1}\right\rfloor+1, then (x[i])j=0(x[i])_{j}=0 for any i,ji,j satisfy N+1i2N,K+1jdN+1\leq i\leq 2N,K+1\leq j\leq d.

It is well known that the complexity lower bound of the strongly convex case for an unconstrained smooth problem is Ω(κflog(1/ϵ))\Omega(\sqrt{\kappa_{f}}\log(1/\epsilon)) nesterov2018lectures , the lower bound of the convex case is Ω(LfD/ϵ)\Omega\left(\sqrt{L_{f}}D/\sqrt{\epsilon}\right) nesterov2018lectures and the lower bound of non-convex case is Ω(LfΔ/ϵ2)\Omega(L_{f}\Delta/\epsilon^{2}) carmon2020lower . These papers construct hard instances with zero-chaining property, where only one possible non-zero entry is added to the decision variable at each iteration (see Chapter 2 of nesterov2018lectures ). In contrast, Lemma 5 indicates that in order to add a non-zero entry to each x[i]x[i], it is necessary to take at least N+1N+1 iterations, that is, Ω(κA)\Omega(\kappa_{A}) iterations by Lemma 4. Therefore, our complexity lower bounds need to be multiplied by an additional factor of κA\kappa_{A} on top of the lower bounds of unconstrained problems. This intuitively yields our results listed in Table 1.

4.3 Strongly convex case

Let us construct our hard problem instance based on formulation (20). For any given positive parameters Lfμf>0,α>0L_{f}\geq\mu_{f}>0,\alpha>0 and positive integers dd, we define function G(,):d×dG(\cdot,\cdot):\mathbb{R}^{d}\times\mathbb{R}^{d}\mapsto\mathbb{R} as

G(u,v)=Lfμf4G0(u,v)+μf2(u2+v2),G(u,v)=\frac{L_{f}-\mu_{f}}{4}G_{0}(u,v)+\frac{\mu_{f}}{2}\left(\|u\|^{2}+\|v\|^{2}\right), (23)

where

G0(u,v):=(αu1)2+(v1u2)2+(v2u3)2++(vd1ud)2.G_{0}(u,v)\mathrel{\mathop{:}}=\left(\alpha-u_{1}\right)^{2}+\left(v_{1}-u_{2}\right)^{2}+\left(v_{2}-u_{3}\right)^{2}+\cdots+\left(v_{d-1}-u_{d}\right)^{2}.

The construction is based on Nesterov’s well-known “chain-like” quadratic function nesterov2018lectures . In the following, we give some properties of the constructed problem instance.

Lemma 6

The problem defined in (20) has the following properties.

  1. 1.

    f0(x)f_{0}(x) is LfL_{f}-smooth and μf\mu_{f}-strongly convex.

  2. 2.

    Denote q=κf1κf+1q=\frac{\sqrt{\kappa_{f}}-1}{\sqrt{\kappa_{f}}+1}. Then the optimal solution xx^{\star} of (20) is given by

    x[1]=x[2]==x[2N]=𝕩,x^{\star}[1]=x^{\star}[2]=\cdots=x^{\star}[2N]=\mathbbm{x}^{\star}, (24)

    where 𝕩d\mathbbm{x}^{\star}\in\mathbb{R}^{d} is given by

    𝕩i=αqi+q2d+1i1+q2d+1,i=1,,d.\mathbbm{x}^{\star}_{i}=\alpha\cdot\frac{q^{i}+q^{2d+1-i}}{1+q^{2d+1}},\qquad i=1,\cdots,d. (25)

    Moreover, for any K0K\geq 0, we have i=K+1d(𝕩i)2q2KdKd𝕩2\sum_{i=K+1}^{d}(\mathbbm{x}^{\star}_{i})^{2}\geq q^{2K}\cdot\frac{d-K}{d}\|\mathbbm{x}^{\star}\|^{2}.

Proof

Part 1: Let us fix the vectors u,v,ω,νdu,v,\omega,\nu\in\mathbb{R}^{d} with (ω,ν)=1\|(\omega,\nu)\|=1, and define h(θ):=G0(u+θω,v+θν)h(\theta)\mathrel{\mathop{:}}=G_{0}(u+\theta\omega,v+\theta\nu). If we take v0=0v_{0}=0 and ν0=0\nu_{0}=0, it holds

h′′(0)=\displaystyle h^{\prime\prime}(0)= i=1d2G0(u,v)ui2ωi2+i=0d12G0(u,v)vi2νi2+2i=1d2G0(u,v)uivi1ωiνi1\displaystyle\sum_{i=1}^{d}\frac{\partial^{2}G_{0}(u,v)}{\partial u_{i}^{2}}\omega_{i}^{2}+\sum_{i=0}^{d-1}\frac{\partial^{2}G_{0}(u,v)}{\partial v_{i}^{2}}\nu_{i}^{2}+2\sum_{i=1}^{d}\frac{\partial^{2}G_{0}(u,v)}{\partial u_{i}\partial v_{i-1}}\omega_{i}\nu_{i-1}
=\displaystyle= 2i=1dωi2+2i=0d1νi24i=1dωiνi1.\displaystyle 2\sum_{i=1}^{d}\omega_{i}^{2}+2\sum_{i=0}^{d-1}\nu_{i}^{2}-4\sum_{i=1}^{d}\omega_{i}\nu_{i-1}.

On the one hand, due to Cauchy inequality, we have h′′(0)0h^{\prime\prime}(0)\geq 0. On the other hand,

h′′(0)2i=1dωi2+2i=0dνi2+2i=1d(ωi2+νi12)4,h^{\prime\prime}(0)\leq 2\sum_{i=1}^{d}\omega_{i}^{2}+2\sum_{i=0}^{d}\nu_{i}^{2}+2\sum_{i=1}^{d}(\omega_{i}^{2}+\nu_{i-1}^{2})\leq 4,

where the last inequality follows from (ω,ν)=1\|(\omega,\nu)\|=1. Therefore, G0(u,v)G_{0}(u,v) is convex and 44-smooth, hence G(u,v)G(u,v) is μf\mu_{f}-strongly convex and LfL_{f}-smooth, which implies f0(x)f_{0}(x) is also μf\mu_{f}-strongly convex and LfL_{f}-smooth.

Part 2: For any xx satisfies the constraint in (20), we have x=(v,v,,v)x=(v,v,\cdots,v) and f0(x)=NG(v,v)f_{0}(x)=NG(v,v). Thus, we only need to verify that 𝕩\mathbbm{x}^{\star} is the (unique) minimum point of the function G(v,v)G(v,v). By the optimality condition vG(v,v)=0\nabla_{v}G(v,v)=0, we have

{(2+β)v1v2=α,v1+(2+β)v2v3=0,vd2+(2+β)vd1vn=0,vd1+(1+β)vd=0,\left\{\begin{array}[]{l}(2+\beta)v_{1}^{\star}-v_{2}^{\star}=\alpha,\\ -v_{1}^{\star}+(2+\beta)v_{2}^{\star}-v_{3}^{\star}=0,\\ \qquad\qquad\vdots\\ -v_{d-2}^{\star}+(2+\beta)v_{d-1}^{\star}-v_{n}^{\star}=0,\\ -v_{d-1}^{\star}+(1+\beta)v_{d}^{\star}=0,\end{array}\right.

where we denote β:=4μfLfμf\beta\mathrel{\mathop{:}}=\frac{4\mu_{f}}{L_{f}-\mu_{f}}. Note that q=12((2+β)(2+β)24)q=\frac{1}{2}\left((2+\beta)-\sqrt{(2+\beta)^{2}-4}\right) is the smallest root of the quadratic equation λ2(2+β)λ+1=0\lambda^{2}-(2+\beta)\lambda+1=0. By a direct calculation, we can check that 𝕩\mathbbm{x}^{\star} satisfies the dd equations, and hence 𝕩\mathbbm{x}^{\star} is the minimum of the function G(v,v)G(v,v). Lastly, it holds

i=K+1d(𝕩i)2=\displaystyle\sum_{i=K+1}^{d}(\mathbbm{x}_{i}^{\star})^{2}= α2i=K+1dq2i+q4d+22i+2q2d+1(1+q2d+1)2\displaystyle\alpha^{2}\sum_{i=K+1}^{d}\frac{q^{2i}+q^{4d+2-2i}+2q^{2d+1}}{(1+q^{2d+1})^{2}}
=(i)\displaystyle\stackrel{{\scriptstyle(i)}}{{=}} α2(1+q2d+1)2(2q2d+1(dK)+i=K+12dKq2i)\displaystyle\frac{\alpha^{2}}{(1+q^{2d+1})^{2}}\left(2q^{2d+1}(d-K)+\sum_{i=K+1}^{2d-K}q^{2i}\right)
(ii)\displaystyle\stackrel{{\scriptstyle(ii)}}{{\geq}} α2(1+q2d+1)2(2q2d+1(dK)+2d2K2di=12dq2i+2K)\displaystyle\frac{\alpha^{2}}{(1+q^{2d+1})^{2}}\left(2q^{2d+1}(d-K)+\frac{2d-2K}{2d}\sum_{i=1}^{2d}q^{2i+2K}\right)
\displaystyle\geq α2(dK)q2Kd(1+q2d+1)2(2dq2d+1+i=12dq2i)\displaystyle\frac{\alpha^{2}(d-K)q^{2K}}{d(1+q^{2d+1})^{2}}\left(2dq^{2d+1}+\sum_{i=1}^{2d}q^{2i}\right)
=\displaystyle= q2KdKd𝕩2,\displaystyle q^{2K}\cdot\frac{d-K}{d}\|\mathbbm{x}^{\star}\|^{2},

where (i)(i) follows from i=K+1d(q2i+q4d+22i)=i=K+1dq2i+i=d+12dKq2i=i=K+12dKq2i\sum_{i=K+1}^{d}(q^{2i}+q^{4d+2-2i})=\sum_{i=K+1}^{d}q^{2i}+\sum_{i=d+1}^{2d-K}q^{2i}=\sum_{i=K+1}^{2d-K}q^{2i}, (ii)(ii) holds because q<1q<1 and 12d2Ki=K+12dKq2i12di=K+12d+Kq2i=12di=12dq2i+2K\frac{1}{2d-2K}\sum_{i=K+1}^{2d-K}q^{2i}\geq\frac{1}{2d}\sum_{i=K+1}^{2d+K}q^{2i}=\frac{1}{2d}\sum_{i=1}^{2d}q^{2i+2K}.∎

Now we are ready to give our lower bound result for the strongly convex case.

Theorem 4.2

Let parameters Lf>μf>0,κA1L_{f}>\mu_{f}>0,\kappa_{A}\geq 1 be given. For any integer kκA+12k\geq\left\lfloor\frac{\kappa_{A}+1}{2}\right\rfloor, there exists an instance in SC(Lf,μf,κA,D)\mathcal{F}_{\mathrm{SC}}(L_{f},\mu_{f},\kappa_{A},D) of form (20), with component function GG defined in (23), N=κA+12,K=k2N+1+1,d=2KN=\left\lfloor\frac{\kappa_{A}+1}{2}\right\rfloor,K=\left\lfloor\frac{k-2}{N+1}\right\rfloor+1,d=2K and α\alpha be suitably chosen. For this problem, the iterates generated by any first-order algorithm in 𝒜\mathcal{A} satisfies

SubOpt𝖲𝖢(xk)12(κf1κf+1)2kNx0x.\operatorname{SubOpt}_{\mathsf{SC}}(x_{k})\geq\frac{1}{2}\left(\frac{\sqrt{\kappa_{f}}-1}{\sqrt{\kappa_{f}}+1}\right)^{\frac{2k}{N}}\|x_{0}-x^{\star}\|.
Proof

By property (1), we know f0(x)f_{0}(x) is LfL_{f}-smooth and μf\mu_{f}-strongly convex. The condition number of AA is not great than 2N21(κA+1)221κA\sqrt{2N^{2}-1}\leq\sqrt{\frac{(\kappa_{A}+1)^{2}}{2}-1}\leq\kappa_{A}, and α\alpha can be suitably chosen so that x0x=D\|x_{0}-x^{\star}\|=D. Thus the instance we construct belongs to the problem class SC(Lf,μf,κA)\mathcal{F}_{\mathrm{SC}}(L_{f},\mu_{f},\kappa_{A}). According to and Lemma 5 and our choice of KK, we have (xk[j])s=0(x_{k}[j])_{s}=0 for any N+1j2NN+1\leq j\leq 2N and K+1sdK+1\leq s\leq d. Therefore,

xkx2N×s=K+1d(𝕩s)2N×q2KdKd𝕩2,\left\|x_{k}-x^{\star}\right\|^{2}\geq N\times\sum_{s=K+1}^{d}\left(\mathbbm{x}_{s}^{\star}\right)^{2}\geq N\times q^{2K}\cdot\frac{d-K}{d}\left\|\mathbbm{x}^{\star}\right\|^{2},

where the last inequality comes from Property 2. Notice that x0x2=2N𝕩2\left\|x_{0}-x^{\star}\right\|^{2}=2N\left\|\mathbbm{x}^{\star}\right\|^{2} and K=d2K=\frac{d}{2}. Substituting them into the above inequality yields

xkx2q2K4x0x2.\left\|x_{k}-x^{\star}\right\|^{2}\geq\frac{q^{2K}}{4}\left\|x_{0}-x^{\star}\right\|^{2}. (26)

Plugging in the definition q=κf1κf+1q=\frac{\sqrt{\kappa_{f}}-1}{\sqrt{\kappa_{f}}+1} and the fact that K2kNK\leq\frac{2k}{N} completes the proof. ∎

Corollary 6

For any first-order algorithm in 𝒜\mathcal{A}, parameters κf2\kappa_{f}\geq 2 and 0<ϵD200<\epsilon\leq\frac{D}{20}, there exists a problem instance in SC(Lf,μf,κA,D)\mathcal{F}_{\mathrm{SC}}(L_{f},\mu_{f},\kappa_{A},D) such that at least

Ω(κAκflog(Dϵ))\Omega\left(\kappa_{A}\sqrt{\kappa_{f}}\log\left(\frac{D}{\epsilon}\right)\right)

iterations are required in order to find an iterate xkx_{k} satisfies SubOpt𝖲𝖢(xk)ϵ\operatorname{SubOpt}_{\mathsf{SC}}(x_{k})\leq\epsilon.

4.4 Non-convex case

For the hard instance we present in (20), the definition of suboptimal measure becomes

SubOpt𝖭𝖢(x):=Lfx𝒫𝒳(x12Lff0(x)),\operatorname{SubOpt}_{\mathsf{NC}}(x)\mathrel{\mathop{:}}=L_{f}\left\|x-\mathcal{P}_{\mathcal{X}}\left(x-\frac{1}{2L_{f}}\nabla f_{0}(x)\right)\right\|,

where 𝒳\mathcal{X} refers to the feasible region of problem (20), i.e., 𝒳:={x2Ndx[1]=x[2]==x[2N]}\mathcal{X}\mathrel{\mathop{:}}=\{x\in\mathbb{R}^{2Nd}\mid x[1]=x[2]=\cdots=x[2N]\}. For given positive integer dd, we define the function G0(,):d×dG_{0}(\cdot,\cdot):\mathbb{R}^{d}\times\mathbb{R}^{d}\mapsto\mathbb{R} as

G0(u,v)=Ψ(1)Φ(u1)+i=2d[Ψ(vi1)Φ(ui)Ψ(vi1)Φ(ui)],G_{0}(u,v)=-\Psi(1)\Phi(u_{1})+\sum_{i=2}^{d}\left[\Psi(-v_{i-1})\Phi(-u_{i})-\Psi(v_{i-1})\Phi(u_{i})\right], (27)

where function Φ(x)\Phi(x) and Ψ(x)\Psi(x) are defined as

Ψ(x):={0x1/2exp(11(2x1)2)x>1/2andΦ(x)=exe12t2dt.\Psi(x)\mathrel{\mathop{:}}=\begin{cases}0&x\leq 1/2\\ \exp\left(1-\frac{1}{(2x-1)^{2}}\right)&x>1/2\end{cases}\quad\text{and}\quad\Phi(x)=\sqrt{e}\int_{-\infty}^{x}e^{-\frac{1}{2}t^{2}}\mathrm{d}t.

The function G(,)G(\cdot,\cdot) in formulation (20) is a scaled version of G0(,)G_{0}(\cdot,\cdot) and its formal definition will be given later. Let us discuss G0(,)G_{0}(\cdot,\cdot) first. Define g0(v):=G0(v,v)g_{0}(v)\mathrel{\mathop{:}}=G_{0}(v,v) and one can observe g0(v)g_{0}(v) coincides with the hard instance constructed in carmon2020lower . We give some useful properties of G0(u,v)G_{0}(u,v).

Lemma 7

The function G0(u,v)G_{0}(u,v) has the following properties.

  1. 1.

    If vd=0v_{d}=0, then g0(v)1\|\nabla g_{0}(v)\|\geq 1.

  2. 2.

    G0(0,0)infu,vG0(u,v)12dG_{0}(0,0)-\inf_{u,v}G_{0}(u,v)\leq 12d.

  3. 3.

    G0(u,v)G_{0}(u,v) is l0l_{0}-smooth with l0l_{0} being a universal constant that is independent of the problem dimension and other constants.

Proof

Part 1 comes from carmon2020lower .

Part 2: Note that 0Ψ(x)e0\leq\Psi(x)\leq e and 0Φ(x)2πe0\leq\Phi(x)\leq\sqrt{2\pi e}. it holds

G0(u,v)Ψ(1)Φ(u1)i=2dΨ(vi1)Φ(ui)de2πe12d.G_{0}(u,v)\geq-\Psi(1)\Phi(u_{1})-\sum_{i=2}^{d}\Psi(v_{i-1})\Phi(u_{i})\geq-de\sqrt{2\pi e}\geq-12d.

Combining the fact that G0(0,0)0G_{0}(0,0)\leq 0 yields the property.

To prove Part 3, fix u,v,p,qdu,v,p,q\in\mathbb{R}^{d} with (p,q)=1\|(p,q)\|=1, define the function h():h(\cdot):\mathbb{R}\mapsto\mathbb{R} by the directional projection of G0(u,v)G_{0}(u,v) along (p,q)(p,q), i.e., h(θ):=G0(u+θp,v+θq)h(\theta)\mathrel{\mathop{:}}=G_{0}(u+\theta p,v+\theta q). Taking v0=1v_{0}=1 and q0=0q_{0}=0, We have

h′′(0)=i=1d2G0(u,v)ui2pi2+i=1d12G0(u,v)vi2qi2+2i=2d2G0(u,v)uivi1piqi1.h^{\prime\prime}(0)=\sum_{i=1}^{d}\frac{\partial^{2}G_{0}(u,v)}{\partial u_{i}^{2}}p_{i}^{2}+\sum_{i=1}^{d-1}\frac{\partial^{2}G_{0}(u,v)}{\partial v_{i}^{2}}q_{i}^{2}+2\sum_{i=2}^{d}\frac{\partial^{2}G_{0}(u,v)}{\partial u_{i}\partial v_{i-1}}p_{i}q_{i-1}.

By simple derivations, we can obtain 0<Ψ(x)<e,0Ψ(x)54/e,|Ψ′′(x)|850<\Psi(x)<e,0\leq\Psi^{\prime}(x)\leq\sqrt{54/e},|\Psi^{\prime\prime}(x)|\leq 8^{5} and 0<Φ(x)<2πe,0<Φ(x)e,supx|Φ′′(x)|270<\Phi(x)<\sqrt{2\pi e},0<\Phi^{\prime}(x)\leq\sqrt{e},\sup_{x}|\Phi^{\prime\prime}(x)|\leq 27. It follows

|2G0(u,v)ui2|\displaystyle\left|\frac{\partial^{2}G_{0}(u,v)}{\partial u_{i}^{2}}\right| =|Ψ(vi1)Φ′′(ui)Ψ(vi1)Φ′′(ui)|54e,1id,\displaystyle=\left|\Psi(-v_{i-1})\Phi^{\prime\prime}(-u_{i})-\Psi(v_{i-1})\Phi^{\prime\prime}(u_{i})\right|\leq 54e,\quad 1\leq i\leq d,
|2G0(u,v)vi2|\displaystyle\left|\frac{\partial^{2}G_{0}(u,v)}{\partial v_{i}^{2}}\right| =|Ψ′′(vi)Φ(ui+1)Ψ′′(vi)Φ(ui+1)|22πe×85,1id1,\displaystyle=\left|\Psi^{\prime\prime}(-v_{i})\Phi(-u_{i+1})-\Psi^{\prime\prime}(v_{i})\Phi(u_{i+1})\right|\leq 2\sqrt{2\pi e}\times 8^{5},\quad 1\leq i\leq d-1,
|2G0(u,v)uivi1|\displaystyle\left|\frac{\partial^{2}G_{0}(u,v)}{\partial u_{i}\partial v_{i-1}}\right| =|Ψ(vi1)Φ(ui)Ψ(vi1)Φ(ui)|254,2id.\displaystyle=\left|\Psi^{\prime}(-v_{i-1})\Phi^{\prime}(-u_{i})-\Psi^{\prime}(v_{i-1})\Phi^{\prime}(u_{i})\right|\leq 2\sqrt{54},\quad 2\leq i\leq d.

Therefore, it holds

|h′′(0)|\displaystyle|h^{\prime\prime}(0)| max{|2G0(u,v)ui2|,|2G0(u,v)vi2|,2|2G0(u,v)uivi1|}i=1d(pi2+qi2+piqi1)\displaystyle\leq\max\left\{\left|\frac{\partial^{2}G_{0}(u,v)}{\partial u_{i}^{2}}\right|,\left|\frac{\partial^{2}G_{0}(u,v)}{\partial v_{i}^{2}}\right|,2\left|\frac{\partial^{2}G_{0}(u,v)}{\partial u_{i}\partial v_{i-1}}\right|\right\}\sum_{i=1}^{d}\left(p_{i}^{2}+q_{i}^{2}+p_{i}q_{i-1}\right)
22πe×85i=1d(pi2+qi2+piqi1).\displaystyle\leq 2\sqrt{2\pi e}\times 8^{5}\cdot\sum_{i=1}^{d}\left(p_{i}^{2}+q_{i}^{2}+p_{i}q_{i-1}\right).

Since i=1d(pi2+qi2)=1\sum_{i=1}^{d}(p_{i}^{2}+q_{i}^{2})=1 and i=1dpiqi1pq1\sum_{i=1}^{d}p_{i}q_{i-1}\leq\|p\|\|q\|\leq 1, we obtain

|h′′(0)|600000,|h^{\prime\prime}(0)|\leq 600000,

which completes the proof. ∎

For given positive parameters c0,Lf,α>0c_{0},L_{f},\alpha>0, we construct the following scaled function based on G0(u,v)G_{0}(u,v),

G(u,v)=Lfα2l0G0(uα,vα),G(u,v)=\frac{L_{f}\alpha^{2}}{l_{0}}G_{0}\left(\frac{u}{\alpha},\frac{v}{\alpha}\right), (28)

where α\alpha can be adjusted to fulfill the condition f0(0)infxf0(x)Δf_{0}(0)-\inf_{x}f_{0}(x)\leq\Delta. Similarly, we define g(v):=G(v,v)g(v)\mathrel{\mathop{:}}=G(v,v). By simple derivations, we can generalize Lemma 7 to the case of G(u,v)G(u,v).

Corollary 7

The function G(u,v)G(u,v) has the following properties.

  1. 1.

    G(0,0)infu,vG(u,v)12dl01Lfα2G(0,0)-\inf_{u,v}G(u,v)\leq 12dl_{0}^{-1}L_{f}\alpha^{2}.

  2. 2.

    If vd=0v_{d}=0, then g(v)l01Lfα\|\nabla g(v)\|\geq l_{0}^{-1}L_{f}\alpha.

  3. 3.

    G(u,v)G(u,v) is LfL_{f}-smooth.

Next, we given a key lemma that gives the relationship between SubOpt𝖭𝖢\operatorname{SubOpt}_{\mathsf{NC}} and g\|\nabla g\|.

Lemma 8

For any x2Ndx\in\mathbb{R}^{2Nd}, let x¯=12Ni=12Nx[i]\bar{x}=\frac{1}{2N}\sum_{i=1}^{2N}x[i]. Suppose that G(,)G(\cdot,\cdot) is LfL_{f}-smooth, then we have

SubOpt𝖭𝖢N4g(x¯).\operatorname{SubOpt}_{\mathsf{NC}}\geq\frac{\sqrt{N}}{4}\|\nabla g(\bar{x})\|.
Proof

Denote 𝒢¯(x):=12Ni=12Nx[i]f0(x)\bar{\mathcal{G}}(x)\mathrel{\mathop{:}}=\frac{1}{2N}\sum_{i=1}^{2N}\nabla_{x[i]}f_{0}(x). On the one hand, it holds that

𝒫𝒳(x12Lff0(x))=(x¯12Lf𝒢¯(x),x¯12Lf𝒢¯(x),,x¯12Lf𝒢¯(x)).\mathcal{P}_{\mathcal{X}}\left(x-\frac{1}{2L_{f}}\nabla f_{0}(x)\right)=\left(\bar{x}-\frac{1}{2L_{f}}\bar{\mathcal{G}}(x),\bar{x}-\frac{1}{2L_{f}}\bar{\mathcal{G}}(x),\cdots,\bar{x}-\frac{1}{2L_{f}}\bar{\mathcal{G}}(x)\right).

Therefore, we have

x𝒫𝒳(x12Lff0(x))2=i=12Nx[i]x¯+12Lf𝒢¯(x)2=i=12Nx[i]x¯2+N2Lf2𝒢¯(x)2,\begin{split}\left\|x-\mathcal{P}_{\mathcal{X}}\left(x-\frac{1}{2L_{f}}\nabla f_{0}(x)\right)\right\|^{2}&=\sum_{i=1}^{2N}\left\|x[i]-\bar{x}+\frac{1}{2L_{f}}\bar{\mathcal{G}}(x)\right\|^{2}\\ &=\sum_{i=1}^{2N}\|x[i]-\bar{x}\|^{2}+\frac{N}{2L_{f}^{2}}\|\bar{\mathcal{G}}(x)\|^{2},\end{split} (29)

where the last equality holds because i=12N(x[i]x¯)=0\sum_{i=1}^{2N}(x[i]-\bar{x})=0. On the other hand, recall that g(x)=G(x,x)g(x)=G(x,x), by chain rule we have

Ng(x¯)=NuG(x¯,x¯)+vG(x¯,x¯)i=1NuG(x¯,x¯)uG(x[i],x[i+N])+i=1NvG(x¯,x¯)vG(x[i],x[i+N])+i=1N(uG(x[i],x[i+N])+vG(x[i],x[i+N])).\begin{split}N\|\nabla g(\bar{x})\|=&N\left\|\nabla_{u}G(\bar{x},\bar{x})+\nabla_{v}G(\bar{x},\bar{x})\right\|\\ \leq&\sum_{i=1}^{N}\left\|\nabla_{u}G(\bar{x},\bar{x})-\nabla_{u}G(x[i],x[i+N])\right\|\\ &+\sum_{i=1}^{N}\left\|\nabla_{v}G(\bar{x},\bar{x})-\nabla_{v}G(x[i],x[i+N])\right\|\\ &+\left\|\sum_{i=1}^{N}\left(\nabla_{u}G(x[i],x[i+N])+\nabla_{v}G(x[i],x[i+N])\right)\right\|.\end{split} (30)

For the first term on the right hand side, we have

uG(x¯,x¯)uG(x[i],x[i+N])\displaystyle\left\|\nabla_{u}G(\bar{x},\bar{x})-\nabla_{u}G(x[i],x[i+N])\right\|
\displaystyle\leq uG(x¯,x¯)uG(x[i],x¯)+uG(x[i],x¯)uG(x[i],x[i+N])\displaystyle~{}\left\|\nabla_{u}G(\bar{x},\bar{x})-\nabla_{u}G(x[i],\bar{x})\right\|+\left\|\nabla_{u}G(x[i],\bar{x})-\nabla_{u}G(x[i],x[i+N])\right\|
\displaystyle\leq Lf(x¯x[i]+x¯x[i+N]),\displaystyle~{}L_{f}\left(\|\bar{x}-x[i]\|+\|\bar{x}-x[i+N]\|\right),

where the second inequality is due to the LfL_{f}-smoothness of G(,)G(\cdot,\cdot). Similar derivation also applies to the second term on the right hand side of (30). Therefore, it holds

Ng(x¯)\displaystyle N\|\nabla g(\bar{x})\| 2Lfi=12Nx¯x[i]+2N𝒢¯(x)\displaystyle\leq 2L_{f}\sum_{i=1}^{2N}\|\bar{x}-x[i]\|+2N\|\bar{\mathcal{G}}(x)\|
(i)2Lf2Ni=12Nx¯x[i]2+2N𝒢¯(x)\displaystyle\stackrel{{\scriptstyle(i)}}{{\leq}}2L_{f}\sqrt{2N\sum_{i=1}^{2N}\|\bar{x}-x[i]\|^{2}}+2N\|\bar{\mathcal{G}}(x)\|
(ii)8Lf2N+8Lf2Ni=12Nx[i]x¯2+N2Lf2𝒢¯(x)2\displaystyle\stackrel{{\scriptstyle(ii)}}{{\leq}}\sqrt{8L_{f}^{2}N+8L_{f}^{2}N}\sqrt{\sum_{i=1}^{2N}\|x[i]-\bar{x}\|^{2}+\frac{N}{2L_{f}^{2}}\|\bar{\mathcal{G}}(x)\|^{2}}
(iii)4NSubOpt𝖭𝖢(x),\displaystyle\stackrel{{\scriptstyle(iii)}}{{\leq}}4\sqrt{N}\cdot\operatorname{SubOpt}_{\mathsf{NC}}(x),

where (i)(i) and (ii)(ii) come from Cauchy–Schwarz inequality, (iii)(iii) utilizes equality (29). ∎

We can now state a complexity lower bound for finding an approximate solution of non-convex problems with first-order linear span algorithms.

Theorem 4.3

Let parameters Lf,Δ>0,κA1L_{f},\Delta>0,\kappa_{A}\geq 1 be given. For any integer kκAk\geq\kappa_{A}, there exists an instance in NC(Lf,κA,Δ)\mathcal{F}_{\mathrm{NC}}(L_{f},\kappa_{A},\Delta) of form (20), with component function GG defined in (27), N=κA+12,K=k2N+1+1,d=K+2N=\left\lfloor\frac{\kappa_{A}+1}{2}\right\rfloor,K=\left\lfloor\frac{k-2}{N+1}\right\rfloor+1,d=K+2 and α=l0Δ12NdLf\alpha=\sqrt{\frac{l_{0}\Delta}{12NdL_{f}}}. For this problem, suppose the initial point x0=0x_{0}=0, then the iterates generated by any first-order algorithm in 𝒜\mathcal{A} satisfies

SubOpt𝖭𝖢(xk)c0κALfΔk.\operatorname{SubOpt}_{\mathsf{NC}}(x_{k})\geq c_{0}\cdot\sqrt{\frac{\kappa_{A}L_{f}\Delta}{k}}.
Proof

By property (3) in Corollary 7, we know f0(x)f_{0}(x) is LfL_{f}-smooth. According to the definition of α\alpha, it holds F(0)infxF(x)f0(0)infxf0(x)12Ndl01Lfα2ΔF(0)-\inf_{x}F(x)\leq f_{0}(0)-\inf_{x}f_{0}(x)\leq 12Ndl_{0}^{-1}L_{f}\alpha^{2}\leq\Delta. As proved in Theorem 4.2, the condition number of AA is not great than κA\kappa_{A}. Accordingly, the instance problem belongs to the class NC(Lf,κA,Δ)\mathcal{F}_{\mathrm{NC}}(L_{f},\kappa_{A},\Delta). Since K=d2K=d-2, the last element of all 2N2N vectors is zero, that is, (x[i])d=0,1i2N(x[i])_{d}=0,\forall 1\leq i\leq 2N. It implies x¯d=0\bar{x}_{d}=0. Utilizing Lemma 8 and property (2) in Corollary 7, we obtain

SubOpt𝖭𝖢(x)N4g(x¯)N4l01Lfα=14LfΔ12l0d.\operatorname{SubOpt}_{\mathsf{NC}}(x)\geq\frac{\sqrt{N}}{4}\|\nabla g(\bar{x})\|\geq\frac{\sqrt{N}}{4}l_{0}^{-1}L_{f}\alpha=\frac{1}{4}\sqrt{\frac{L_{f}\Delta}{12l_{0}d}}.

By definition, we have d=K+25kκAd=K+2\leq\frac{5k}{\kappa_{A}}, and this gives the desired result. ∎

Corollary 8

For any first-order algorithm in 𝒜\mathcal{A} and 0<ϵc0LfΔ0<\epsilon\leq c_{0}\sqrt{L_{f}\Delta} where c0c_{0} is defined in Theorem 4.3, there exists a problem instance in NC(Lf,κA,Δ)\mathcal{F}_{\mathrm{NC}}(L_{f},\kappa_{A},\Delta) such that at least

Ω(κALfΔϵ2)\Omega\left(\frac{\kappa_{A}L_{f}\Delta}{\epsilon^{2}}\right)

iterations are required in order to find an iterate xkx_{k} satisfies SubOpt𝖭𝖢(xk)ϵ\operatorname{SubOpt}_{\mathsf{NC}}(x_{k})\leq\epsilon.

4.5 Convex case

For the linear equality constrained problem, the definition of suboptimality becomes

SubOpt𝖢(x)=f(x)minx:Ax=bf(x)+ρAxb.\displaystyle\operatorname{SubOpt}_{\mathsf{C}}(x)=f(x)-\min_{x:Ax=b}f(x)+\rho\|Ax-b\|.

Recall that for ρ2y\rho\geq 2\left\|y^{\star}\right\|, SubOpt𝖢(x)\operatorname{SubOpt}_{\mathsf{C}}(x) implies the standard optimality measure max{|f(x)f(x)|,Axb}\max\{\left|f(x)-f(x^{\star})\right|,\left\|Ax-b\right\|\}.

Note that in Section 4.3, we consider the strongly convex problem class (Lf,μf,κA)\mathcal{F}(L_{f},\mu_{f},\kappa_{A}) with μf>0\mu_{f}>0. In this section, we demonstrate how the complexity lower bound present in Theorem 4.2 can be reduced to the convex problem class with μf=0\mu_{f}=0.

Theorem 4.4

Let parameters Lf,D>0,κA1L_{f},D>0,\kappa_{A}\geq 1 be given. For any integer kκAk\geq\kappa_{A}, there exists an instance in C(Lf,κA,D)\mathcal{F}_{\mathrm{C}}(L_{f},\kappa_{A},D) of form (20), with component function GG defined in (23), N=κA+12,K=k2N+1+1,d=2K,μf=Lf(K+1)2N=\left\lfloor\frac{\kappa_{A}+1}{2}\right\rfloor,K=\left\lfloor\frac{k-2}{N+1}\right\rfloor+1,d=2K,\mu_{f}=\frac{L_{f}}{(K+1)^{2}} and α\alpha be suitably chosen. For this problem, the kk-th iterate xkx_{k} generated by any first-order algorithm in 𝒜\mathcal{A} satisfies

f(xk)f(x)λ,Axkbc0κA2LfD2k2,f(x_{k})-f(x^{\star})-\left\langle{\lambda^{\star}},{Ax_{k}-b}\right\rangle\geq c_{0}\cdot\frac{\kappa_{A}^{2}L_{f}D^{2}}{k^{2}},

where λ\lambda^{\star} is the optimal dual variable and c0c_{0} is a universal constant.

Proof

We can suitably choose α\alpha so that x0x=D\|x_{0}-x^{\star}\|=D. Then, under our choice of parameters, we have κf=Lf/μf=(K+1)2\kappa_{f}=L_{f}/\mu_{f}=(K+1)^{2}. Note that qexp(2κf1)q\geq\exp\left(-\frac{2}{\sqrt{\kappa_{f}}-1}\right), and hence q2Ke4q^{2K}\geq e^{-4}. Therefore, using the fact that ff is μf\mu_{f}-strongly convex and ATλ=f(x)A^{\mathrm{T}}\lambda^{\star}=\nabla f(x^{\star}), we have

f(xk)f(x)λ,Axkb=\displaystyle f(x_{k})-f(x^{\star})-\left\langle{\lambda^{\star}},{Ax_{k}-b}\right\rangle= f(xk)f(x)f(x),xkx\displaystyle~{}f(x_{k})-f(x^{\star})-\left\langle{\nabla f(x^{\star})},{x_{k}-x^{\star}}\right\rangle
\displaystyle\geq μf2xkx2μfq2K8x0x2\displaystyle~{}\frac{\mu_{f}}{2}\left\|x_{k}-x^{\star}\right\|^{2}\geq\frac{\mu_{f}q^{2K}}{8}\left\|x_{0}-x^{\star}\right\|^{2}
\displaystyle\geq e48LfD2(K+1)2,\displaystyle~{}\frac{e^{-4}}{8}\cdot\frac{L_{f}D^{2}}{(K+1)^{2}},

where the last two inequalities are due to (26). According to the choice of KK and NN, we have K+14kκAK+1\leq\frac{4k}{\kappa_{A}}, and hence complete proof. ∎

Corollary 9

For any first-order algorithm in 𝒜\mathcal{A} and any 0<ϵc0LfD20<\epsilon\leq c_{0}L_{f}D^{2} where c0c_{0} is defined in Theorem 4.4, there exists a problem instance in C(Lf,κA,D)\mathcal{F}_{\mathrm{C}}(L_{f},\kappa_{A},D) such that at least

Ω(κALfDϵ)\Omega\left(\frac{\kappa_{A}\sqrt{L_{f}}D}{\sqrt{\epsilon}}\right)

iterations are required in order to find an iterate xkx_{k} satisfies SubOpt𝖢(xk)ϵ\operatorname{SubOpt}_{\mathsf{C}}(x_{k})\leq\epsilon with ρλ\rho\geq\left\|\lambda^{\star}\right\|.

5 Linear equality constrained problem

In this section, we focus on a special case of (1) when h()=𝕀{0}()h(\cdot)=\mathbb{I}_{\{0\}}(\cdot), corresponding to the linear equality constrained problem

minxf(x),s.t.Ax=b.\min_{x}\ f(x),\quad\,\textrm{s.t.}\,\ Ax=b. (31)

In Section 5.1, it will be demonstrated that the requirement of full row rank for the matrix AA can be removed. In Section 5.2, we will show that the optimal iteration complexity can be achieved by directly applying APPA on the convex problem.

5.1 Refined result of linear equality constrained problem

First, we consider the strongly convex case. Similar with the derivation in Section 3.1, problem (31) is equivalent to

minxmaxλ(x,λ):=f(x)+λT(Axb).\min_{x}\max_{\lambda}\ \mathcal{L}(x,\lambda)\mathrel{\mathop{:}}=f(x)+\lambda^{\mathrm{T}}(Ax-b).

Due to the strong duality, we have minxmaxλ(x,λ)=maxλminx(x,λ)\min_{x}\max_{\lambda}\ \mathcal{L}(x,\lambda)=\max_{\lambda}\min_{x}\ \mathcal{L}(x,\lambda). Accordingly, the corresponding dual problem can be written as

maxλΦ(λ):=bTλf(ATλ).\max_{\lambda}\ \Phi(\lambda)\mathrel{\mathop{:}}=-b^{\mathrm{T}}\lambda-f^{\star}(-A^{\mathrm{T}}\lambda). (32)

Without the assumption that AA is full row rank, the proof of Lemma 3 no longer holds and Φ(λ)\Phi(\lambda) is not necessarily strongly convex. Actually, we have the following property.

Lemma 9

Let σ¯min\underline{\sigma}_{\min} be the nonzero minimal singular value of AA. It holds that Φ(λ)\Phi(\lambda) is (σ¯min2/Lf)(\underline{\sigma}_{\min}^{2}/L_{f})-strongly concave in the subspace (A)\mathcal{R}(A).

Proof

Noting that f(x)f(x) is LfL_{f}-smooth and μf\mu_{f}-strongly convex, it is easy to derive that f(x)f^{\star}(x) is 1μf\frac{1}{\mu_{f}}-smooth and 1Lf\frac{1}{L_{f}}-strongly convex. Then for any λ,λ(A)\lambda,\lambda^{\prime}\in\mathcal{R}(A), we have

Φ(λ)Φ(λ)\displaystyle\Phi(\lambda^{\prime})-\Phi(\lambda) =b,λλf(ATλ)+f(ATλ)\displaystyle=-\langle b,\lambda^{\prime}-\lambda\rangle-f^{\star}(-A^{\mathrm{T}}\lambda^{\prime})+f^{\star}(-A^{\mathrm{T}}\lambda)
b,λλ+f(ATλ),ATλATλATλATλ22Lf\displaystyle\leq-\langle b,\lambda^{\prime}-\lambda\rangle+\langle\nabla f^{\star}(-A^{\mathrm{T}}\lambda),A^{\mathrm{T}}\lambda^{\prime}-A^{\mathrm{T}}\lambda\rangle-\frac{\|A^{\mathrm{T}}\lambda-A^{\mathrm{T}}\lambda^{\prime}\|^{2}}{2L_{f}}
=Φ(λ),λλ12LfATλATλ2\displaystyle=\langle\nabla\Phi(\lambda),\lambda^{\prime}-\lambda\rangle-\frac{1}{2L_{f}}\|A^{\mathrm{T}}\lambda-A^{\mathrm{T}}\lambda^{\prime}\|^{2}
Φ(λ),λλσ¯min22Lfλλ2,\displaystyle\leq\langle\nabla\Phi(\lambda),\lambda^{\prime}-\lambda\rangle-\frac{\underline{\sigma}_{\min}^{2}}{2L_{f}}\|\lambda-\lambda^{\prime}\|^{2},

where the last inequality holds because λ,λ(A)\lambda,\lambda^{\prime}\in\mathcal{R}(A). ∎

According to the update rule of Algorithm 2, if we set λ0=0\lambda_{0}=0, then we have

λk+1Span{Ax0b,Ax1b,,Axkb},k0.\lambda_{k+1}\in\mathrm{Span}\left\{Ax_{0}-b,Ax_{1}-b,\cdots,Ax_{k}-b\right\},k\geq 0.

Due to the feasibility of the constraint Ax=bAx=b, we have b(A)b\in\mathcal{R}(A). Hence it holds that the iterate sequence {λk}\{\lambda_{k}\} always stays in (A)\mathcal{R}(A). Combining Lemma 9, we can treat Φ(λ)\Phi(\lambda) as a strongly concave function during the iterations. Then the derivation in Section 3.1 still holds by replacing μA\mu_{A} with σ¯min\underline{\sigma}_{\min} and replacing κA\kappa_{A} with κ¯A:=LAσ¯min\underline{\kappa}_{A}\mathrel{\mathop{:}}=\frac{L_{A}}{\underline{\sigma}_{\min}}. Consequently, the derived complexity upper bound is O~(κ¯Aκflog(1/ϵ))\tilde{O}(\underline{\kappa}_{A}\sqrt{\kappa_{f}}\log(1/\epsilon)).

For the convex problem, Algorithm 2 is utilized to solve the strongly convex auxiliary problem (19), hence the upper bound can also be generalized to O~(κ¯ALfDϵ)\tilde{O}\left(\frac{\underline{\kappa}_{A}\sqrt{L_{f}}D}{\sqrt{\epsilon}}\right). For the nonconvex problem, Algorithm 2 is utilized to solve the subproblem (16). Therefore, the complexity upper bound can be similarly generalized to O~(κ¯ALfΔϵ2)\tilde{O}\left(\frac{\underline{\kappa}_{A}L_{f}\Delta}{\epsilon^{2}}\right).

5.2 Direct acceleration for convex problem

Recall that in Section 3.4, we derive the upper bound of the convex problem by constructing a strongly convex auxiliary problem. In this section, we propose an optimal algorithm, present in Algorithm 4, that solves the original convex problem directly. The algorithm performs an inexact accelerated proximal point method on f(x)f(x) while keeping the constraint intact in the outer loop. In the inner loop, it uses Algorithm 2 to iteratively solve the subproblem until it meets the first-order suboptimality conditions.

1 Input: initial point x0x_{0}, smoothness parameter LfL_{f}, condition number κA\kappa_{A}, subproblem error tolerances {ϵk}\{\epsilon_{k}\} and {γk}\{\gamma_{k}\}, the maximum iteration number TT.
2 Initialize: y1x0y_{1}\leftarrow x_{0}, t11t_{1}\leftarrow 1.
3 for k=1Tk=1\cdots T do
4       Apply Algorithm 2 to find a pair of suboptimal primal and dual variable (xk,λk)(x_{k},\lambda_{k}) of problem
minxf(x)+Lf2xyk2,s.t.Ax=b,\min_{x}f(x)+\frac{L_{f}}{2}\|x-y_{k}\|^{2},\quad\,\textrm{s.t.}\,Ax=b, (33)
such that f(xk)+Lf(xkyk)+ATλkLf2ϵktk\|\nabla f(x_{k})+L_{f}(x_{k}-y_{k})+A^{\mathrm{T}}\lambda_{k}\|\leq\sqrt{\frac{L_{f}}{2}}\cdot\frac{\epsilon_{k}}{t_{k}} and Axkbγk\|Ax_{k}-b\|\leq\gamma_{k}.
5       Compute tk+1=1+1+4tk22t_{k+1}=\frac{1+\sqrt{1+4t_{k}^{2}}}{2}.
6       Compute yk+1=xk+(tk1tk+1)(xkxk1)y_{k+1}=x_{k}+\left(\frac{t_{k}-1}{t_{k+1}}\right)\left(x_{k}-x_{k-1}\right).
Output: xTx_{T}.
Algorithm 4 Inexact APPA for convex problems

The convergence proof is inspired by beck2009fast and jiang2012inexact . Different from several work that studies the inexact accelerated PPA for the unconstrained convex problem, we have an additional linear equality constraint here. Since the projection onto the linear constraint is not allowed, we need to incorporate the violation of the constraint into the analysis, which makes the convergence proof quite different. For simplicity, we define the following notations:

f(xk)+Lf(xkyk)+ATλk=δk,Axkb=ζk,\nabla f(x_{k})+L_{f}(x_{k}-y_{k})+A^{\mathrm{T}}\lambda_{k}=\delta_{k},~{}~{}Ax_{k}-b=\zeta_{k},
vk=f(xk)f(x)+λ,Axkb0,uk=tkxk(tk1)xk1x,v_{k}=f(x_{k})-f(x^{\star})+\langle\lambda^{\star},Ax_{k}-b\rangle\geq 0,~{}~{}u_{k}=t_{k}x_{k}-\left(t_{k}-1\right)x_{k-1}-x^{\star},
ak=tk2vk0,bk=Lf2uk2,τ=Lf2x0x2,ek=tkδk,uk,a_{k}=t_{k}^{2}v_{k}\geq 0,~{}~{}b_{k}=\frac{L_{f}}{2}\|u_{k}\|^{2},~{}~{}\tau=\frac{L_{f}}{2}\|x_{0}-x^{\star}\|^{2},~{}~{}e_{k}=t_{k}\left\langle\delta_{k},u_{k}\right\rangle,
ωk=xkx,ξk=|tkλkλ,Auk|,ϵ¯k=j=1kϵj,ξ¯k=j=1k(ξj+ϵj2).\omega_{k}=\|x_{k}-x^{\star}\|,~{}~{}\xi_{k}=|t_{k}\langle\lambda_{k}-\lambda^{\star},Au_{k}\rangle|,~{}~{}\bar{\epsilon}_{k}=\sum_{j=1}^{k}\epsilon_{j},~{}~{}\bar{\xi}_{k}=\sum_{j=1}^{k}\left(\xi_{j}+\epsilon_{j}^{2}\right).

Due to the KKT condition, we know Ax=bAx^{\star}=b and ATλ=f(x)A^{\mathrm{T}}\lambda^{\star}=\nabla f(x^{\star}). It follows that vk=f(xk)f(x)f(x),xxv_{k}=f(x_{k})-f(x^{\star})-\langle\nabla f(x^{\star}),x-x^{\star}\rangle, which corresponds to the Bregman divergence associated with ff for point xkx_{k} and xx^{\star}. In other words, vkv_{k} measures the distance between xkx_{k} and xx^{\star} under the Bregman divergence and can somewhat serves as a suboptimality measure. In the following analysis, we will give an upper bound of vkv_{k} (or aka_{k}) first and eventually derive an upper bound for the objective function gap and the constraint violation.

According to the update rule tk+1=1+1+4tk22t_{k+1}=\frac{1+\sqrt{1+4t_{k}^{2}}}{2} and t1=1t_{1}=1, it is easy to obtain the following lemma, which will be frequently used in the following proofs.

Lemma 10

The sequence {tk}\{t_{k}\} generated by Algorithm 4 satisfies k+12tkk\frac{k+1}{2}\leq t_{k}\leq k for any k1k\geq 1.

In the following lemma, we give the one-step estimation of ak+bka_{k}+b_{k}.

Lemma 11

For any k1k\geq 1, it holds

ak+1+bk+1ak+bk+ξk+1+ek+1.a_{k+1}+b_{k+1}\leq a_{k}+b_{k}+\xi_{k+1}+e_{k+1}. (34)
Proof

For each jj and any xx, we have

f(x)f(xj)xxj,f(xj)=xxj,δjLf(xjyj)ATλjLf2xjyj2+Lfxjyj,yjx+xxj,δjATλj.\begin{split}f(x)-f(x_{j})&\geq\langle x-x_{j},\nabla f(x_{j})\rangle\\ &=\langle x-x_{j},\delta_{j}-L_{f}(x_{j}-y_{j})-A^{\mathrm{T}}\lambda_{j}\rangle\\ &\geq\frac{L_{f}}{2}\|x_{j}-y_{j}\|^{2}+L_{f}\langle x_{j}-y_{j},y_{j}-x\rangle\\ &\quad+\langle x-x_{j},\delta_{j}-A^{\mathrm{T}}\lambda_{j}\rangle.\end{split} (35)

Note that vkvk+1=f(xk)f(xk+1)+λ,AxkAxk+1v_{k}-v_{k+1}=f(x_{k})-f(x_{k+1})+\langle\lambda^{\star},Ax_{k}-Ax_{k+1}\rangle. We apply (35) with j=k+1j=k+1 and x=xkx=x_{k} to get

vkvk+1Lf2xk+1yk+12+Lfxk+1yk+1,yk+1xk+ζkζk+1,λλk+1+xkxk+1,δk+1,\begin{split}v_{k}-v_{k+1}&\geq\frac{L_{f}}{2}\|x_{k+1}-y_{k+1}\|^{2}+L_{f}\langle x_{k+1}-y_{k+1},y_{k+1}-x_{k}\rangle\\ &\quad+\langle\zeta_{k}-\zeta_{k+1},\lambda^{\star}-\lambda_{k+1}\rangle+\langle x_{k}-x_{k+1},\delta_{k+1}\rangle,\end{split} (36)

where we utilize the fact that AxkAxk+1=ζkζk+1Ax_{k}-Ax_{k+1}=\zeta_{k}-\zeta_{k+1}. Similarly, let j=k+1j=k+1 and x=xx=x^{\star} in (35), we obtain

vk+1=f(x)f(xk+1)λ,Axk+1bLf2xk+1yk+12+Lfxk+1yk+1,yk+1xAxAxk+1,λk+1λ,Axk+1b+xxk+1,δk+1=Lf2xk+1yk+12+Lfxk+1yk+1,yk+1xζk+1,λλk+1+xxk+1,δk+1,\begin{split}-v_{k+1}&=f(x^{\star})-f(x_{k+1})-\langle\lambda^{\star},Ax_{k+1}-b\rangle\\ &\geq\frac{L_{f}}{2}\|x_{k+1}-y_{k+1}\|^{2}+L_{f}\langle x_{k+1}-y_{k+1},y_{k+1}-x^{\star}\rangle\\ &\quad-\langle Ax^{\star}-Ax_{k+1},\lambda_{k+1}\rangle-\langle\lambda^{\star},Ax_{k+1}-b\rangle+\langle x^{\star}-x_{k+1},\delta_{k+1}\rangle\\ &=\frac{L_{f}}{2}\|x_{k+1}-y_{k+1}\|^{2}+L_{f}\langle x_{k+1}-y_{k+1},y_{k+1}-x^{\star}\rangle\\ &\quad-\langle\zeta_{k+1},\lambda^{\star}-\lambda_{k+1}\rangle+\langle x^{\star}-x_{k+1},\delta_{k+1}\rangle,\end{split} (37)

Multiply (36) by (tk+11)(t_{k+1}-1) and add it to (37). Then, multiply tk+1t_{k+1} on the both side of the obtained inequality and notice the relation tk2=tk+12tk+1t_{k}^{2}=t_{k+1}^{2}-t_{k+1}, we have

tk2vktk+12vk+1\displaystyle t_{k}^{2}v_{k}-t_{k+1}^{2}v_{k+1} Lf2tk+1(xk+1yk+1)2\displaystyle\geq\frac{L_{f}}{2}\|t_{k+1}(x_{k+1}-y_{k+1})\|^{2}
+Lftk+1xk+1yk+1,tk+1yk+1(tk+11)xkx\displaystyle\quad+L_{f}t_{k+1}\langle x_{k+1}-y_{k+1},t_{k+1}y_{k+1}-(t_{k+1}-1)x_{k}-x^{\star}\rangle
+λλk+1,tk2ζktk+12ζk+1tk+1δk+1,uk+1.\displaystyle\quad+\langle\lambda^{\star}-\lambda_{k+1},t_{k}^{2}\zeta_{k}-t_{k+1}^{2}\zeta_{k+1}\rangle-t_{k+1}\langle\delta_{k+1},u_{k+1}\rangle.

For the first two terms on the right hand side of the above inequality, we use the usual Pythagoras relation ba2+2ba,ac=bc2ac2\|{b}-{a}\|^{2}+2\langle{b}-{a},{a}-{c}\rangle=\|{b}-{c}\|^{2}-\|{a}-{c}\|^{2}, then substitute tk+1yk+1=tk+1xk+(tk1)(xkxk1)t_{k+1}y_{k+1}=t_{k+1}x_{k}+\left(t_{k}-1\right)\left(x_{k}-x_{k-1}\right) into it. After rearranging, we obtain

tk+12vk+1+Lf2uk+12tk2vk+Lf2uk2λλk+1,tk2ζktk+12ζk+1+ek+1.t_{k+1}^{2}v_{k+1}+\frac{L_{f}}{2}\|u_{k+1}\|^{2}\leq t_{k}^{2}v_{k}+\frac{L_{f}}{2}\|u_{k}\|^{2}-\langle\lambda^{\star}-\lambda_{k+1},t_{k}^{2}\zeta_{k}-t_{k+1}^{2}\zeta_{k+1}\rangle+e_{k+1}.

Combining the fact tk+1Auk+1=tk+12ζk+1tk2ζkt_{k+1}Au_{k+1}=t_{k+1}^{2}\zeta_{k+1}-t_{k}^{2}\zeta_{k} yields the conclusion. ∎

Now, having the inequality (34), we can derive an upper bound of aka_{k}.

Lemma 12

For any k1k\geq 1, it holds

ak(τ+ϵ¯k)2+2ξ¯k.a_{k}\leq\left(\sqrt{\tau}+\bar{\epsilon}_{k}\right)^{2}+2\bar{\xi}_{k}. (38)
Proof

By applying (35) with x=xx=x^{\star} and j=1j=1, we have

f(x)f(x1)\displaystyle f(x^{\star})-f(x_{1}) Lf2x1y12+Lfx1y1,y1x\displaystyle\geq\frac{L_{f}}{2}\|x_{1}-y_{1}\|^{2}+L_{f}\langle x_{1}-y_{1},y_{1}-x^{\star}\rangle
AxAx1,λ1+xx1,δ1\displaystyle\quad-\langle Ax^{\star}-Ax_{1},\lambda_{1}\rangle+\langle x^{\star}-x_{1},\delta_{1}\rangle
=Lf2x1x2Lf2y1x2\displaystyle=\frac{L_{f}}{2}\|x_{1}-x^{\star}\|^{2}-\frac{L_{f}}{2}\|y_{1}-x^{\star}\|^{2}
AxAx1,λ1+xx1,δ1,\displaystyle\quad-\langle Ax^{\star}-Ax_{1},\lambda_{1}\rangle+\langle x^{\star}-x_{1},\delta_{1}\rangle,

where we utilize the usual Pythagoras relation ba2+2ba,ac=bc2ac2\|{b}-{a}\|^{2}+2\langle{b}-{a},{a}-{c}\rangle=\|{b}-{c}\|^{2}-\|{a}-{c}\|^{2}. Noting that y1=x0y_{1}=x_{0}, t1=1t_{1}=1 and u1=x1xu_{1}=x_{1}-x^{\star}, we have

t12v1+Lf2u12Lf2x0x2+t1λλ1,Au1+e1.t_{1}^{2}v_{1}+\frac{L_{f}}{2}\|u_{1}\|^{2}\leq\frac{L_{f}}{2}\|x_{0}-x^{\star}\|^{2}+t_{1}\langle\lambda^{\star}-\lambda_{1},Au_{1}\rangle+e_{1}.

Since ek(tk2/Lfδk)(Lf/2uk)ϵkbke_{k}\leq(t_{k}\sqrt{2/L_{f}}\|\delta_{k}\|)(\sqrt{L_{f}/2}\|u_{k}\|)\leq\epsilon_{k}\sqrt{b_{k}}, the above inequality implies

a1+b1τ+ξ1+ϵ1b1.a_{1}+b_{1}\leq\tau+\xi_{1}+\epsilon_{1}\sqrt{b_{1}}. (39)

Let sk=i=1kξk+i=1kϵibis_{k}=\sum_{i=1}^{k}\xi_{k}+\sum_{i=1}^{k}\epsilon_{i}\sqrt{b_{i}}. Utilizing (34) repeatedly and combining (39), we obtain

ak+bkτ+sk.a_{k}+b_{k}\leq\tau+s_{k}. (40)

Since ak0a_{k}\geq 0, we have sk=sk1+ϵkbk+ξksk1+ϵkτ+sk+ξks_{k}=s_{k-1}+\epsilon_{k}\sqrt{b_{k}}+\xi_{k}\leq s_{k-1}+\epsilon_{k}\sqrt{\tau+s_{k}}+\xi_{k}, which implies

τ+sk12(ϵk+ϵk2+4(τ+sk1+ξk)).\sqrt{\tau+s_{k}}\leq\frac{1}{2}\left(\epsilon_{k}+\sqrt{\epsilon_{k}^{2}+4\left(\tau+s_{k-1}+\xi_{k}\right)}\right).

By some simple derivations, we get

sksk1+12ϵk2+ξk+12ϵkϵk2+4(τ+sk1+ξk)sk1+12ϵk2+ξk+12ϵk(ϵk+2τ+2sk1+ξk)sk1+ϵk2+ξk+ϵk(τ+sk1+ξk).\begin{split}s_{k}&\leq s_{k-1}+\frac{1}{2}\epsilon_{k}^{2}+\xi_{k}+\frac{1}{2}\epsilon_{k}\sqrt{\epsilon_{k}^{2}+4\left(\tau+s_{k-1}+\xi_{k}\right)}\\ &\leq s_{k-1}+\frac{1}{2}\epsilon_{k}^{2}+\xi_{k}+\frac{1}{2}\epsilon_{k}\left(\epsilon_{k}+2\sqrt{\tau}+2\sqrt{s_{k-1}+\xi_{k}}\right)\\ &\leq s_{k-1}+\epsilon_{k}^{2}+\xi_{k}+\epsilon_{k}\left(\sqrt{\tau}+\sqrt{s_{k-1}+\xi_{k}}\right).\end{split} (41)

From the inequality (39), we know that τb1ϵ1b1ξ1\tau\geq b_{1}-\epsilon_{1}\sqrt{b_{1}}-\xi_{1}, which follows b112(ϵ1+ϵ12+4(τ+ξ1))ϵ1+τ+ξ1\sqrt{b_{1}}\leq\frac{1}{2}\left(\epsilon_{1}+\sqrt{\epsilon_{1}^{2}+4\left(\tau+\xi_{1}\right)}\right)\leq\epsilon_{1}+\sqrt{\tau+\xi_{1}}. Accordingly,

s1=ϵ1b1+ξ1ϵ1(ϵ1+τ+ξ1)+ξ1ϵ12+ξ1+ϵ1(τ+ξ1).s_{1}=\epsilon_{1}\sqrt{b_{1}}+\xi_{1}\leq\epsilon_{1}\left(\epsilon_{1}+\sqrt{\tau+\xi_{1}}\right)+\xi_{1}\leq\epsilon_{1}^{2}+\xi_{1}+\epsilon_{1}\left(\sqrt{\tau}+\sqrt{\xi_{1}}\right). (42)

Applying (41) repeatedly and combining (42) yields

sk\displaystyle s_{k} s1+j=2kϵj2+j=2kξj+τj=2kϵj+j=2kϵjsj1+ξj\displaystyle\leq s_{1}+\sum_{j=2}^{k}\epsilon_{j}^{2}+\sum_{j=2}^{k}\xi_{j}+\sqrt{\tau}\sum_{j=2}^{k}\epsilon_{j}+\sum_{j=2}^{k}\epsilon_{j}\sqrt{s_{j-1}+\xi_{j}}
j=1kϵj2+j=1kξj+τj=1kϵj+j=1kϵjsj\displaystyle\leq\sum_{j=1}^{k}\epsilon_{j}^{2}+\sum_{j=1}^{k}\xi_{j}+\sqrt{\tau}\sum_{j=1}^{k}\epsilon_{j}+\sum_{j=1}^{k}\epsilon_{j}\sqrt{s_{j}}
ξ¯k+τϵ¯k+skϵ¯k,\displaystyle\leq\bar{\xi}_{k}+\sqrt{\tau}\bar{\epsilon}_{k}+\sqrt{s_{k}}\bar{\epsilon}_{k},

where the second inequality holds because sj1+ξjsjs_{j-1}+\xi_{j}\leq s_{j} and ξ1s1\xi_{1}\leq s_{1} by the definition of sks_{k}. The above inequality implies

sk12(ϵ¯k+(ϵ¯k2+4ξ¯k+4ϵ¯kτ)1/2),\sqrt{s_{k}}\leq\frac{1}{2}\left(\bar{\epsilon}_{k}+\left(\bar{\epsilon}_{k}^{2}+4\bar{\xi}_{k}+4\bar{\epsilon}_{k}\sqrt{\tau}\right)^{1/2}\right), (43)

and hence skϵ¯k2+2ξ¯k+2ϵ¯kτs_{k}\leq\bar{\epsilon}_{k}^{2}+2\bar{\xi}_{k}+2\bar{\epsilon}_{k}\sqrt{\tau}. Substituting it into (40), we obtain

akτ+ϵ¯k2+2ξ¯k+2ϵ¯kτ(τ+ϵ¯k)2+2ξ¯k,a_{k}\leq\tau+\bar{\epsilon}_{k}^{2}+2\bar{\xi}_{k}+2\bar{\epsilon}_{k}\sqrt{\tau}\leq\left(\sqrt{\tau}+\bar{\epsilon}_{k}\right)^{2}+2\bar{\xi}_{k},

which completes the proof. ∎

Note that the right hand side of (38) depends on ξk\xi_{k}, which is still unclear. We further estimate this term in the following lemma.

Lemma 13

Assume that γk1\gamma_{k}\leq 1, δk1\|\delta_{k}\|\leq 1 and tk+12γk+1tk2γkt_{k+1}^{2}\gamma_{k+1}\leq t_{k}^{2}\gamma_{k} for any k1k\geq 1, we have

ξ1\displaystyle\xi_{1} 1σ¯min2(LfLAω1+Lf(LAD+1)+LA)t12γ1,\displaystyle\leq\frac{1}{\underline{\sigma}_{\min}^{2}}(L_{f}L_{A}\omega_{1}+L_{f}(L_{A}D+1)+L_{A})t_{1}^{2}\gamma_{1}, (44)
ξk\displaystyle\xi_{k} 2σ¯min2(LfLAωk+4Lf+LA)tk12γk1,k2.\displaystyle\leq\frac{2}{\underline{\sigma}_{\min}^{2}}(L_{f}L_{A}\omega_{k}+4L_{f}+L_{A})t_{k-1}^{2}\gamma_{k-1},\quad\forall k\geq 2. (45)
Proof

By the update rule yk+1=xk+(tk1tk+1)(xkxk1)y_{k+1}=x_{k}+\left(\frac{t_{k}-1}{t_{k+1}}\right)\left(x_{k}-x_{k-1}\right), it holds

Axk+1Ayk+1=ζk+1(1+tk1tk+1)ζk+tk1tk+1ζk1.\displaystyle Ax_{k+1}-Ay_{k+1}=\zeta_{k+1}-\left(1+\frac{t_{k}-1}{t_{k+1}}\right)\zeta_{k}+\frac{t_{k}-1}{t_{k+1}}\zeta_{k-1}.

Thus, we have Axk+1Ayk+14\|Ax_{k+1}-Ay_{k+1}\|\leq 4 due to the fact that γk1\gamma_{k}\leq 1 and 1tktk+11\leq t_{k}\leq t_{k+1}. Combining the KKT condition f(x)+ATλ=0\nabla f(x^{\star})+A^{\mathrm{T}}\lambda^{\star}=0 with the definition of δk+1\delta_{k+1}, we obtain

AT(λλk+1)=f(xk+1)f(x)+Lf(xk+1yk+1)δk+1.A^{\mathrm{T}}(\lambda^{\star}-\lambda_{k+1})=\nabla f(x_{k+1})-\nabla f(x^{\star})+L_{f}(x_{k+1}-y_{k+1})-\delta_{k+1}.

Accordingly, it follows

AAT(λλk+1)=A(f(xk+1)f(x))+Lf(Axk+1Ayk+1)Aδk+1LfLAxk+1x+4Lf+LA,\begin{split}&\|AA^{\mathrm{T}}(\lambda^{\star}-\lambda_{k+1})\|\\ =&~{}\|A(\nabla f(x_{k+1})-\nabla f(x^{\star}))+L_{f}(Ax_{k+1}-Ay_{k+1})-A\delta_{k+1}\|\\ \leq&~{}{L_{f}L_{A}\|x_{k+1}-x^{\star}\|}+4L_{f}+L_{A},\end{split} (46)

where the inequality uses the smoothness of ff and δk1\|\delta_{k}\|\leq 1. Since λλk+1(A)\lambda^{\star}-\lambda_{k+1}\in\mathcal{R}(A), the above inequality actually implies

λλk+11σ¯min2(LfLAxk+1x+4Lf+LA).\|\lambda^{\star}-\lambda_{k+1}\|\leq\frac{1}{\underline{\sigma}_{\min}^{2}}\left({L_{f}L_{A}\|x_{k+1}-x^{\star}\|}+4L_{f}+L_{A}\right).

It follows that for any k1k\geq 1,

ξk+1=|λλk+1,tk+1Auk+1|1σ¯min2(LfLAxk+1x+4Lf+LA)tk2ζktk+12ζk+12σ¯min2(LfLAxk+1x+4Lf+LA)tk2γk,\begin{split}\xi_{k+1}&=|\langle\lambda^{\star}-\lambda_{k+1},t_{k+1}Au_{k+1}\rangle|\\ &\leq\frac{1}{\underline{\sigma}_{\min}^{2}}(L_{f}L_{A}\|x_{k+1}-x^{\star}\|+4L_{f}+L_{A})\|t_{k}^{2}\zeta_{k}-t_{k+1}^{2}\zeta_{k+1}\|\\ &\leq\frac{2}{\underline{\sigma}_{\min}^{2}}(L_{f}L_{A}\|x_{k+1}-x^{\star}\|+4L_{f}+L_{A})t_{k}^{2}\gamma_{k},\end{split} (47)

where the last inequality uses the tk+12γk+1tk2γk2t_{k+1}^{2}\gamma_{k+1}\leq t_{k}^{2}\gamma_{k}^{2}.

Similar to the derivations in (46) and (47), we can also obtain

AAT(λλ1)\displaystyle\|AA^{\mathrm{T}}(\lambda^{\star}-\lambda_{1})\|
=\displaystyle= A(f(x1)f(x))+Lf(Ax1Ay1)Aδ1\displaystyle~{}\|A(\nabla f(x_{1})-\nabla f(x^{\star}))+L_{f}(Ax_{1}-Ay_{1})-A\delta_{1}\|
=\displaystyle= A(f(x1)f(x))+Lf(Ax1b)Lf(Ax0Ax)Aδ1\displaystyle~{}\|A(\nabla f(x_{1})-\nabla f(x^{\star}))+L_{f}(Ax_{1}-b)-L_{f}(Ax_{0}-Ax^{\star})-A\delta_{1}\|
\displaystyle\leq LfLAω1+Lfγ1+LfLAD+LAδ1\displaystyle~{}L_{f}L_{A}\omega_{1}+L_{f}\gamma_{1}+L_{f}L_{A}D+L_{A}\|\delta_{1}\|
\displaystyle\leq LfLAω1+Lf+LfLAD+LA.\displaystyle~{}L_{f}L_{A}\omega_{1}+L_{f}+L_{f}L_{A}D+L_{A}.

Note that u1=x1xu_{1}=x_{1}-x^{\star}, we have Au1=Ax1bAu_{1}=Ax_{1}-b and

ξ1t12γ1λ1λ1σ¯min2(LfLAω1+Lf(LAD+1)+LA)t12γ1,\xi_{1}\leq t_{1}^{2}\gamma_{1}\|\lambda_{1}-\lambda^{\star}\|\leq\frac{1}{\underline{\sigma}_{\min}^{2}}(L_{f}L_{A}\omega_{1}+L_{f}(L_{A}D+1)+L_{A})t_{1}^{2}\gamma_{1},

which completes the proof. ∎

According to Lemma 12 and Lemma 13, it remains an estimate of ωk\omega_{k} to get the upper bound of aka_{k}. In the discussion that follows, we prove that ωk\omega_{k} can be bounded by a linear increasing sequence. Utilizing this property, we give the requirements on ϵk\epsilon_{k} and γk\gamma_{k}, and obtain the following result.

Theorem 5.1

Suppose that f(x)f(x) is convex, x0xD\|x_{0}-x^{\star}\|\leq D, the matrix AA satisfies ALA\left\|A\right\|\leq L_{A} and the minimum nonzero singular value of AA is no smaller than σ¯min\underline{\sigma}_{\min}. In Algorithm 4, we set the subproblem error tolerances as ϵkmin{LfD22k2,2Lf}\epsilon_{k}\leq\min\left\{\frac{\sqrt{L_{f}}D}{2\sqrt{2}k^{2}},\sqrt{\frac{2}{L_{f}}}\right\} and γkmin{σ¯min2LfD28(LfLA(ϖ+D)+4Lf+LA),1}1tk2(k+1)3\gamma_{k}\leq\min\left\{\frac{\underline{\sigma}_{\min}^{2}L_{f}D^{2}}{8(L_{f}L_{A}(\varpi+D)+4L_{f}+L_{A})},1\right\}\frac{1}{t_{k}^{2}(k+1)^{3}} where ϖ\varpi is a constant defined in (49). Let {xk}\{x_{k}\} be the iterate sequence generated by Algorithm 4. It holds that

vk16LfD2(k+1)2.v_{k}\leq\frac{16L_{f}D^{2}}{(k+1)^{2}}.
Proof

We can check that the assumptions in Lemma 12 are satisfied: γk1\gamma_{k}\leq 1, δk=1tkLf2ϵk1\|\delta_{k}\|=\frac{1}{t_{k}}\sqrt{\frac{L_{f}}{2}}\epsilon_{k}\leq 1 and tk+12γk+1tk2γkt_{k+1}^{2}\gamma_{k+1}\leq t_{k}^{2}\gamma_{k} for any k1k\geq 1. We also have ϵ¯k=i=1kϵiLf2D\bar{\epsilon}_{k}=\sum_{i=1}^{k}\epsilon_{i}\leq\sqrt{\frac{L_{f}}{2}}D and i=1kϵi2LfD24\sum_{i=1}^{k}\epsilon_{i}^{2}\leq\frac{L_{f}D^{2}}{4}.

First, we discuss the upper bound of ωk\omega_{k}. By definition,

uk=tk(xkx)(tk1)(xk1x)tkωk(tk1)ωk1,\|u_{k}\|=\|t_{k}(x_{k}-x^{\star})-(t_{k}-1)(x_{k-1}-x^{\star})\|\geq t_{k}\omega_{k}-(t_{k}-1)\omega_{k-1},

where the last inequality holds because tk1t_{k}\geq 1. By (40), it holds bk=Lf2ukτ+sk\sqrt{b_{k}}=\sqrt{\frac{L_{f}}{2}}\|u_{k}\|\leq\sqrt{\tau}+\sqrt{s_{k}}. It follows

tkωk(tk1)ωk12Lf(τ+sk),t_{k}\omega_{k}-(t_{k}-1)\omega_{k-1}\leq\sqrt{\frac{2}{L_{f}}}(\sqrt{\tau}+\sqrt{s_{k}}),

which implies

ωkωk1+2Lf(τ+sk).\omega_{k}\leq\omega_{k-1}+\sqrt{\frac{2}{L_{f}}}(\sqrt{\tau}+\sqrt{s_{k}}).

Substituting (43) into the above inequality and combining (45), we get

ωk\displaystyle\omega_{k} ωk1+2τLf+12Lf(ϵ¯k+(ϵ¯k2+4ξ¯k1+4ξk+4ϵk2+4ϵ¯kτ)1/2)\displaystyle\leq\omega_{k-1}+\sqrt{\frac{2\tau}{L_{f}}}+\sqrt{\frac{1}{2{L_{f}}}}\left(\bar{\epsilon}_{k}+\left(\bar{\epsilon}_{k}^{2}+4\bar{\xi}_{k-1}+4\xi_{k}+4\epsilon_{k}^{2}+4\bar{\epsilon}_{k}\sqrt{\tau}\right)^{1/2}\right)
ωk1+2τLf+12Lfϵ¯k+12Lf(ϵ¯k2+4ξ¯k1\displaystyle\leq\omega_{k-1}+\sqrt{\frac{2\tau}{L_{f}}}+\sqrt{\frac{1}{2{L_{f}}}}\bar{\epsilon}_{k}+\sqrt{\frac{1}{2{L_{f}}}}\left(\bar{\epsilon}_{k}^{2}+4\bar{\xi}_{k-1}\right.
+8LfLAtk12γk1σ¯min2ωk+8(4Lf+LA)tk12γk1σ¯min2+4ϵk2+4ϵ¯kτ)1/2.\displaystyle\left.\qquad\qquad+\frac{8L_{f}L_{A}t_{k-1}^{2}\gamma_{k-1}}{\underline{\sigma}_{\min}^{2}}\omega_{k}+\frac{8(4L_{f}+L_{A})t_{k-1}^{2}\gamma_{k-1}}{\underline{\sigma}_{\min}^{2}}+4\epsilon_{k}^{2}+4\bar{\epsilon}_{k}\sqrt{\tau}\right)^{1/2}.

For the simplicity of notation, we denote C0:=2τLf+12Lfϵ¯kC_{0}\mathrel{\mathop{:}}=\sqrt{\frac{2\tau}{L_{f}}}+\sqrt{\frac{1}{2{L_{f}}}}\bar{\epsilon}_{k}, C1:=4LAtk12γk1σ¯min2C_{1}\mathrel{\mathop{:}}=\frac{4L_{A}t_{k-1}^{2}\gamma_{k-1}}{\underline{\sigma}_{\min}^{2}},
C2:=12Lf(ϵ¯k2+4ξ¯k1+8(4Lf+LA)tk12γk1σ¯min2+4ϵk2+4ϵ¯kτ)C_{2}\mathrel{\mathop{:}}=\frac{1}{2L_{f}}\left(\bar{\epsilon}_{k}^{2}+4\bar{\xi}_{k-1}+\frac{8(4L_{f}+L_{A})t_{k-1}^{2}\gamma_{k-1}}{\underline{\sigma}_{\min}^{2}}+4\epsilon_{k}^{2}+4\bar{\epsilon}_{k}\sqrt{\tau}\right), and the above inequality can be rewritten into

ωkωk1+C0+C1(C2C1+ωk)1/2,\omega_{k}\leq\omega_{k-1}+C_{0}+\sqrt{C_{1}}\left(\frac{C_{2}}{C_{1}}+\omega_{k}\right)^{1/2},

which implies

(C2C1+ωk)1/212(C1+C1+4ωk1+4C0+4C2/C1).\left(\frac{C_{2}}{C_{1}}+\omega_{k}\right)^{1/2}\leq\frac{1}{2}\left(\sqrt{C_{1}}+\sqrt{C_{1}+4\omega_{k-1}+4C_{0}+4C_{2}/C_{1}}\right).

By simple derivations, we have

ωkC12+ωk1+C0+12C12+4C1ωk1+4C0C1+4C2ωk1+C1+C0+C0C1+C1ωk1+C2.\begin{split}\omega_{k}&\leq\frac{C_{1}}{2}+\omega_{k-1}+C_{0}+\frac{1}{2}\sqrt{C_{1}^{2}+4C_{1}\omega_{k-1}+4C_{0}C_{1}+4C_{2}}\\ &\leq\omega_{k-1}+C_{1}+C_{0}+\sqrt{C_{0}C_{1}}+\sqrt{C_{1}\omega_{k-1}}+\sqrt{C_{2}}.\end{split} (48)

On the one hand, we observe that the upper bound of ωk\omega_{k} depends on C0C_{0} and C2C_{2}, which further depend on ξ¯k1\bar{\xi}_{k-1}. On the other hand, from (45), we know ξk\xi_{k} can be bounded by ωk\omega_{k}. Observing the relationship, we can prove the upper bound of both ωk\omega_{k} and ξk\xi_{k} by induction. It is easy to derive that C0C¯0:=3D2C_{0}\leq\bar{C}_{0}\mathrel{\mathop{:}}=\frac{3D}{2}, C1C¯1:=4LAσ¯min2C_{1}\leq\bar{C}_{1}\mathrel{\mathop{:}}=\frac{4L_{A}}{\underline{\sigma}_{\min}^{2}} and C212Lf(3LfD2+4ξ¯k1+8(4Lf+LA)σ¯min2)C_{2}\leq\frac{1}{2L_{f}}\left(3L_{f}D^{2}+4\bar{\xi}_{k-1}+\frac{8(4L_{f}+L_{A})}{\underline{\sigma}_{\min}^{2}}\right). Let C¯2:=12Lf(6LfD2+8(4Lf+LA)σ¯min2)\bar{C}_{2}\mathrel{\mathop{:}}=\frac{1}{2L_{f}}\left(6L_{f}D^{2}+\frac{8(4L_{f}+L_{A})}{\underline{\sigma}_{\min}^{2}}\right), D0=C¯0+C¯1+C¯2+C¯0C¯1,D1=4LAσ¯min2D_{0}=\bar{C}_{0}+\bar{C}_{1}+\sqrt{\bar{C}_{2}}+\sqrt{\bar{C}_{0}\bar{C}_{1}},D_{1}=\frac{4L_{A}}{\underline{\sigma}_{\min}^{2}}. Let ω¯1\bar{\omega}_{1} be the maximum zero point of (51) and

ϖ=max{(D1+D1+4D0)24,ω¯1}.\varpi=\max\left\{\frac{\left(\sqrt{D_{1}}+\sqrt{D_{1}+4D_{0}}\right)^{2}}{4},\bar{\omega}_{1}\right\}. (49)

Note that C¯0,C¯1,C¯2,D0,D1,ω¯1\bar{C}_{0},\bar{C}_{1},\bar{C}_{2},D_{0},D_{1},\bar{\omega}_{1} and ϖ\varpi only depend on the constants LA,σ¯min,LfL_{A},\underline{\sigma}_{\min},L_{f} and DD. We aim to prove ωkkϖ\omega_{k}\leq k\varpi and ξkLfD24k2\xi_{k}\leq\frac{L_{f}D^{2}}{4k^{2}}.

First, we prove the case when k=1k=1. Note that u1=x1x=ω1\|u_{1}\|=\|x_{1}-x^{\star}\|=\omega_{1}. By (40), we have

ω12Lf(τ+s1)\omega_{1}\leq\sqrt{\frac{2}{L_{f}}}(\sqrt{\tau}+\sqrt{s_{1}})

Combining with (42) yields

ω12Lf(τ+(ϵ12+ξ1+ϵ1(τ+ξ1))1/2).\omega_{1}\leq\sqrt{\frac{2}{L_{f}}}\left(\sqrt{\tau}+\left(\epsilon_{1}^{2}+\xi_{1}+\epsilon_{1}\left(\sqrt{\tau}+\sqrt{\xi_{1}}\right)\right)^{1/2}\right). (50)

Putting (50) and (44) together and combining ϵ12Lf\epsilon_{1}\leq\sqrt{\frac{2}{L_{f}}} yield a quartic inequality equation with respect to ω1\omega_{1}:

ω12Lf[τ+(2Lf+1σ¯min2(LfLAω1+Lf(LAD+1)+LA)+2Lf(τ+1σ¯min2(LfLAω1+Lf(LAD+1)+LA)))1/2].\begin{split}\omega_{1}\leq\sqrt{\frac{2}{L_{f}}}\left[\sqrt{\tau}+\left(\frac{2}{L_{f}}+\frac{1}{\underline{\sigma}_{\min}^{2}}(L_{f}L_{A}\omega_{1}+L_{f}(L_{A}D+1)+L_{A})\right.\right.\\ \left.\left.+\sqrt{\frac{2}{L_{f}}}\left(\sqrt{\tau}+\sqrt{\frac{1}{\underline{\sigma}_{\min}^{2}}(L_{f}L_{A}\omega_{1}+L_{f}(L_{A}D+1)+L_{A})}\right)\right)^{1/2}\right].\end{split} (51)

Recall the definition of ω¯1\bar{\omega}_{1} and ϖ\varpi. We get ω1ω¯1ϖ\omega_{1}\leq\bar{\omega}_{1}\leq\varpi. Thus, it holds

ξ11σ¯min2(LfLAϖ+Lf(LAD+1)+LA)t12γ1LfD24,\xi_{1}\leq\frac{1}{\underline{\sigma}^{2}_{\min}}(L_{f}L_{A}\varpi+L_{f}(L_{A}D+1)+L_{A})t_{1}^{2}\gamma_{1}\leq\frac{L_{f}D^{2}}{4},

where the last inequality holds because γkσ¯min2LfD28(LfLA(ϖ+D)+4Lf+LA)1tk2(k+1)3\gamma_{k}\leq\frac{\underline{\sigma}_{\min}^{2}L_{f}D^{2}}{8(L_{f}L_{A}(\varpi+D)+4L_{f}+L_{A})}\frac{1}{t_{k}^{2}(k+1)^{3}}.

Suppose that there exists k2k\geq 2 such ωiiϖ\omega_{i}\leq i\varpi and ξiLfD24i2\xi_{i}\leq\frac{L_{f}D^{2}}{4i^{2}} for any 1ik11\leq i\leq k-1. Then we have i=1k1ξiLfD22\sum_{i=1}^{k-1}\xi_{i}\leq\frac{L_{f}D^{2}}{2} and ξ¯k1=i=1k1ξi+i=1k1ϵ23LfD24\bar{\xi}_{k-1}=\sum_{i=1}^{k-1}\xi_{i}+\sum_{i=1}^{k-1}\epsilon^{2}\leq\frac{3L_{f}D^{2}}{4}. Hence C2C¯2C_{2}\leq\bar{C}_{2} and (48) implies

ωk\displaystyle\omega_{k} ωk1+C¯1+C¯0+C¯0C¯1+C1ωk1+C¯2\displaystyle\leq\omega_{k-1}+\bar{C}_{1}+\bar{C}_{0}+\sqrt{\bar{C}_{0}\bar{C}_{1}}+\sqrt{C_{1}\omega_{k-1}}+\sqrt{\bar{C}_{2}}
(k1)ϖ+C¯1+C¯0+C¯0C¯1+(k1)C1ϖ+C¯2\displaystyle\leq(k-1)\varpi+\bar{C}_{1}+\bar{C}_{0}+\sqrt{\bar{C}_{0}\bar{C}_{1}}+\sqrt{(k-1)C_{1}\varpi}+\sqrt{\bar{C}_{2}}
(k1)ϖ+D0+D1ϖ\displaystyle\leq(k-1)\varpi+D_{0}+\sqrt{D_{1}\varpi}
kϖ,\displaystyle\leq k\varpi,

where the third inequality is due to C1ωk14LAσ¯min2tk12γk1(k1)ϖD1ϖC_{1}\omega_{k-1}\leq\frac{4L_{A}}{\underline{\sigma}_{\min}^{2}}t_{k-1}^{2}\gamma_{k-1}\cdot(k-1)\varpi\leq D_{1}\varpi and the last inequality holds because ϖ\varpi is greater than the maximum zero point of the equation D0+D1x=xD_{0}+\sqrt{D_{1}x}=x. Then we can obtain ξkLfD24k2\xi_{k}\leq\frac{L_{f}D^{2}}{4k^{2}} due to (45) and γkσ¯min2LfD28(LfLA(ϖ+D)+4Lf+LA)1tk2(k+1)3\gamma_{k}\leq\frac{\underline{\sigma}_{\min}^{2}L_{f}D^{2}}{8(L_{f}L_{A}(\varpi+D)+4L_{f}+L_{A})}\frac{1}{t_{k}^{2}(k+1)^{3}}. Consequently, by induction we conclude that ξkLfD24k2\xi_{k}\leq\frac{L_{f}D^{2}}{4k^{2}} for any k1k\geq 1. It follows ξ¯kLfD2\bar{\xi}_{k}\leq L_{f}D^{2}. Combining the fact that tk(k+1)/2t_{k}\geq(k+1)/2 and inequality (38) yields

f(xk)f(x)+λ,Axkb(Lf2D+Lf2D)2+2LfD2(k+1)2/416LfD2(k+1)2,f(x_{k})-f(x^{\star})+\langle\lambda^{\star},Ax_{k}-b\rangle\leq\frac{\left(\sqrt{\frac{L_{f}}{2}}D+\sqrt{\frac{L_{f}}{2}}D\right)^{2}+2L_{f}D^{2}}{(k+1)^{2}/4}\leq\frac{16L_{f}D^{2}}{(k+1)^{2}},

which completes the proof. ∎

Corollary 10

Under the same assumptions and same choices of the parameters ϵk\epsilon_{k} and γk\gamma_{k} in Theorem 5.1, in order to find an approximate solution xkx_{k} satisfying vk=f(xk)f(x)+λ,Axkbϵv_{k}=f(x_{k})-f(x^{\star})+\langle\lambda^{\star},Ax_{k}-b\rangle\leq\epsilon, the total number of gradient evaluations for Algorithm 4 is bounded by

O~(κALfD2ϵ).\tilde{O}\left(\kappa_{A}\sqrt{\frac{L_{f}D^{2}}{\epsilon}}\right).
Proof

In Theorem 5.1, we have proved the outer loop complexity is O(LfD2ϵ)O\left(\sqrt{\frac{L_{f}D^{2}}{\epsilon}}\right). Let xk=argminAx=b{f(x)+Lf2xyk2}x_{k}^{\star}=\operatorname*{arg\,min}_{Ax=b}\left\{f(x)+\frac{L_{f}}{2}\|x-y_{k}\|^{2}\right\}. The KKT conditions are f(xk)+Lf(xkyk)+ATλk=0\nabla f(x_{k}^{\star})+L_{f}(x_{k}^{\star}-y_{k})+A^{\mathrm{T}}\lambda_{k}^{\star}=0 and Axkb=0Ax_{k}^{\star}-b=0. By the definition of δk\delta_{k} and ζk\zeta_{k}, we have

δk\displaystyle\delta_{k} =f(xk)f(xk)+Lf(xkxk)+AT(λkλk),\displaystyle=\nabla f(x_{k})-\nabla f(x_{k}^{\star})+L_{f}(x_{k}-x_{k}^{\star})+A^{\mathrm{T}}(\lambda_{k}-\lambda_{k}^{\star}),
ζk\displaystyle\zeta_{k} =A(xkxk).\displaystyle=A(x_{k}-x_{k}^{\star}).

It follows δk2Lfxkxk+LAλkλk\|\delta_{k}\|\leq 2L_{f}\|x_{k}-x_{k}^{\star}\|+L_{A}\|\lambda_{k}-\lambda_{k}^{\star}\| and ζkLAxkxk\|\zeta_{k}\|\leq L_{A}\|x_{k}-x_{k}^{\star}\|. Let ϵ~k=1LA+2Lfmin{Lf2ϵktk,γk}\tilde{\epsilon}_{k}=\frac{1}{L_{A}+2L_{f}}\cdot\min\left\{\sqrt{\frac{L_{f}}{2}}\cdot\frac{\epsilon_{k}}{t_{k}},\gamma_{k}\right\}, then the subroutine only needs to output a pair (xk,λk)(x_{k},\lambda_{k}) such that xkxkϵ~k\|x_{k}-x_{k}^{\star}\|\leq\tilde{\epsilon}_{k} and λkλkϵ~k\|\lambda_{k}-\lambda_{k}^{\star}\|\leq\tilde{\epsilon}_{k}, the required subproblem error tolerance can be satisfied. The condition number of the objective function of the subproblem (33) is O(1)O(1). Since ϵ~k\tilde{\epsilon}_{k} is the power function of ϵ\epsilon and other parameters, we obtain the number of gradient evaluations in each inner iteration is O~(κA)\tilde{O}(\kappa_{A}) by Corollary 1. Consequently, the overall complexity is O~(κALfD2ϵ)\tilde{O}\left(\kappa_{A}\sqrt{\frac{L_{f}D^{2}}{\epsilon}}\right). ∎

Remark 4

Note that Algorithm 4 outputs an approximate solution of the last subproblem, which satisfies AxTbγT\|Ax_{T}-b\|\leq\gamma_{T}. If we further require that γTmin{ϵ2λ,ϵ}\gamma_{T}\leq\min\left\{\frac{\epsilon}{2\|\lambda^{\star}\|},\epsilon\right\}, the overall complexity remains unchanged up to logarithmic factors since the inner loop converges linearly. Therefore, in order to obtain an approximate solution xTx_{T} satisfying f(xT)f(x)+λ,AxTbϵ2f(x_{T})-f(x^{\star})+\langle\lambda^{\star},Ax_{T}-b\rangle\leq\frac{\epsilon}{2} and AxTbγT\|Ax_{T}-b\|\leq\gamma_{T}, the complexity upper bound is still O~(κALfD2ϵ)\tilde{O}\left(\kappa_{A}\sqrt{\frac{L_{f}D^{2}}{\epsilon}}\right). In this case, we can conclude that f(xT)f(x)ϵf(x_{T})-f(x^{\star})\leq\epsilon because γTϵ2λ\gamma_{T}\leq\frac{\epsilon}{2\|\lambda^{\star}\|}. Therefore, the above result actually implies an complexity upper bound of O~(κALfD2ϵ)\tilde{O}\left(\kappa_{A}\sqrt{\frac{L_{f}D^{2}}{\epsilon}}\right) in order to find an approximate solution xTx_{T} satisfying f(xT)f(x)ϵf(x_{T})-f(x^{\star})\leq\epsilon and AxTbϵ\|Ax_{T}-b\|\leq\epsilon.

6 Conclusions

In this work, we analyze the lower and upper bounds of composite optimization problems in strongly convex, convex, and non-convex scenarios. Different from most of previous studies, we specifically consider problem classes with a given condition number κA\kappa_{A}. Our results demonstrate that the complexities presented are optimal up to logarithmic factors. This study marks the first instance of optimal algorithms for convex and non-convex cases, as well as the first set of lower bounds for all three cases. In the future work, it remains interesting to investigate how algorithms can be designed to further tighten the logarithmic factors in the complexity upper bounds.

Funding This work was supported by the National Natural Science Foundation of China under grant number 11831002.

Data Availability

Declarations

Conflict of interest The authors have no relevant financial interest to disclose.

References

  • (1) Beck, A., Teboulle, M.: A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM Journal on Imaging Sciences 2(1), 183–202 (2009)
  • (2) Bertsekas, D.: Dynamic programming and optimal control: Volume I, vol. 1. Athena scientific (2012)
  • (3) Cai, X., Han, D., Yuan, X.: On the convergence of the direct extension of ADMM for three-block separable convex minimization models with one strongly convex function. Computational Optimization and Applications 66, 39–73 (2017)
  • (4) Carmon, Y., Duchi, J.C., Hinder, O., Sidford, A.: Lower bounds for finding stationary points I. Mathematical Programming 184(1), 71–120 (2020)
  • (5) Chambolle, A., Pock, T.: A first-order primal-dual algorithm for convex problems with applications to imaging. Journal of Mathematical Imaging and Vision 40, 120–145 (2011)
  • (6) Chang, T.H., Hong, M., Liao, W.C., Wang, X.: Asynchronous distributed ADMM for large-scale optimization—part I: Algorithm and convergence analysis. IEEE Transactions on Signal Processing 64(12), 3118–3130 (2016)
  • (7) Feijer, D., Paganini, F.: Stability of primal–dual gradient dynamics and applications to network optimization. Automatica 46(12), 1974–1981 (2010)
  • (8) Hamedani, E.Y., Aybat, N.S.: A primal-dual algorithm with line search for general convex-concave saddle point problems. SIAM Journal on Optimization 31(2), 1299–1329 (2021)
  • (9) Hiriart-Urruty, J., Lemarechal, C.: Convex Analysis and Minimization Algorithms II: Advanced Theory and Bundle Methods. Grundlehren der mathematischen Wissenschaften. Springer Berlin Heidelberg (1996). URL https://books.google.com.sg/books?id=aSizI0n6tnsC
  • (10) Hong, M., Hajinezhad, D., Zhao, M.M.: Prox-PDA: The proximal primal-dual algorithm for fast distributed nonconvex optimization and learning over networks. In: International Conference on Machine Learning, pp. 1529–1538. PMLR (2017)
  • (11) Jiang, K., Sun, D., Toh, K.C.: An inexact accelerated proximal gradient method for large scale linearly constrained convex sdp. SIAM Journal on Optimization 22(3), 1042–1064 (2012)
  • (12) Kong, W., Melo, J.G., Monteiro, R.D.: Complexity of a quadratic penalty accelerated inexact proximal point method for solving linearly constrained nonconvex composite programs. SIAM Journal on Optimization 29(4), 2566–2593 (2019)
  • (13) Kong, W., Melo, J.G., Monteiro, R.D.: Iteration complexity of an inner accelerated inexact proximal augmented lagrangian method based on the classical lagrangian function. SIAM Journal on Optimization 33(1), 181–210 (2023)
  • (14) Lin, T., Ma, S., Zhang, S.: On the global linear convergence of the ADMM with multiblock variables. SIAM Journal on Optimization 25(3), 1478–1497 (2015)
  • (15) Lin, T., Ma, S., Zhang, S.: On the sublinear convergence rate of multi-block ADMM. Journal of the Operations Research Society of China 3, 251–274 (2015)
  • (16) Mokhtari, A., Ozdaglar, A.E., Pattathil, S.: Convergence rate of 𝒪(1/k)\mathcal{O}(1/k) for optimistic gradient and extragradient methods in smooth convex-concave saddle point problems. SIAM Journal on Optimization 30(4), 3230–3251 (2020)
  • (17) Nesterov, Y., et al.: Lectures on convex optimization, vol. 137. Springer (2018)
  • (18) Noschese, S., Pasquini, L., Reichel, L.: Tridiagonal Toeplitz matrices: properties and novel applications. Numerical Linear Algebra with Applications 20(2), 302–326 (2013)
  • (19) Ouyang, Y., Chen, Y., Lan, G., Pasiliao Jr, E.: An accelerated linearized alternating direction method of multipliers. SIAM Journal on Imaging Sciences 8(1), 644–681 (2015)
  • (20) Ouyang, Y., Xu, Y.: Lower complexity bounds of first-order methods for convex-concave bilinear saddle-point problems. Mathematical Programming 185(1-2), 1–35 (2021)
  • (21) Salim, A., Condat, L., Kovalev, D., Richtárik, P.: An optimal algorithm for strongly convex minimization under affine constraints. In: International Conference on Artificial Intelligence and Statistics, pp. 4482–4498. PMLR (2022)
  • (22) Shi, W., Ling, Q., Wu, G., Yin, W.: Extra: An exact first-order algorithm for decentralized consensus optimization. SIAM Journal on Optimization 25(2), 944–966 (2015)
  • (23) Showalter, R.: Monotone operators in banach space and nonlinear partial differential equations. Math. Surv. Mono. 49 (1997)
  • (24) Song, C., Jiang, Y., Ma, Y.: Breaking the O(1/ϵ){O}(1/\epsilon) optimal rate for a class of minimax problems. arXiv preprint arXiv:2003.11758 (2020)
  • (25) Sun, H., Hong, M.: Distributed non-convex first-order optimization and information processing: Lower complexity bounds and rate optimal algorithms. IEEE Transactions on Signal processing 67(22), 5912–5928 (2019)
  • (26) Xu, Y.: Accelerated first-order primal-dual proximal methods for linearly constrained composite convex programming. SIAM Journal on Optimization 27(3), 1459–1484 (2017)
  • (27) Xu, Y.: First-order methods for constrained convex programming based on linearized augmented lagrangian function. INFORMS Journal on Optimization 3(1), 89–117 (2021)
  • (28) Xu, Y.: Iteration complexity of inexact augmented lagrangian methods for constrained convex programming. Mathematical Programming 185, 199–244 (2021)
  • (29) Xu, Y.: First-order methods for problems with o(1)o(1) functional constraints can have almost the same convergence rate as for unconstrained problems. SIAM Journal on Optimization 32(3), 1759–1790 (2022)
  • (30) Yang, J., Zhang, Y.: Alternating direction algorithms for \\backslashell_1-problems in compressive sensing. SIAM journal on scientific computing 33(1), 250–278 (2011)
  • (31) Zhang, J., Luo, Z.Q.: A proximal alternating direction method of multiplier for linearly constrained nonconvex minimization. SIAM Journal on Optimization 30(3), 2272–2302 (2020)
  • (32) Zhang, J., Luo, Z.Q.: A global dual error bound and its application to the analysis of linearly constrained nonconvex optimization. SIAM Journal on Optimization 32(3), 2319–2346 (2022)
  • (33) Zhu, M., Chan, T.: An efficient primal-dual hybrid gradient algorithm for total variation image restoration. Ucla Cam Report 34, 8–34 (2008)
  • (34) Zhu, Z., Chen, F., Zhang, J., Wen, Z.: A unified primal-dual algorithm framework for inequality constrained problems. arXiv preprint arXiv:2208.14196 (2022)