∎

¹¹institutetext: Zhenyuan Zhu ²²institutetext: Beijing International Center for Mathematical Research, Peking University, Beijing, China
²²email: [email protected] ³³institutetext: Fan Chen ⁴⁴institutetext: School of Mathematical Science, Peking University, Beijing, China
⁴⁴email: [email protected] ⁵⁵institutetext: Corresponding Author: Junyu Zhang ⁶⁶institutetext: Department of Industrial Systems Engineering and Management, National University of Singapore, Singapore, Singapore
⁶⁶email: [email protected] ⁷⁷institutetext: Zaiwen Wen ⁸⁸institutetext: International Center for Mathematical Research, Center for Machine Learning Research and College of Engineering, Peking University, Beijing, China
⁸⁸email: [email protected]

On the Optimal Lower and Upper Complexity Bounds for a Class of Composite Optimization Problems

Zhenyuan Zhu Fan Chen Junyu Zhang Zaiwen Wen

Abstract

We study the optimal lower and upper complexity bounds for finding approximate solutions to the composite problem $\min_{x}\ f(x)+h(Ax-b)$ , where $f$ is smooth and $h$ is convex. Given access to the proximal operator of $h$ , for strongly convex, convex, and nonconvex $f$ , we design efficient first order algorithms with complexities $\tilde{O}\left(\kappa_{A}\sqrt{\kappa_{f}}\log\left(1/{\epsilon}\right)\right)$ , $\tilde{O}\left(\kappa_{A}\sqrt{L_{f}}D/\sqrt{\epsilon}\right)$ , and $\tilde{O}\left(\kappa_{A}L_{f}\Delta/\epsilon^{2}\right)$ , respectively. Here, $\kappa_{A}$ is the condition number of the matrix $A$ in the composition, $L_{f}$ is the smoothness constant of $f$ , and $\kappa_{f}$ is the condition number of $f$ in the strongly convex case. $D$ is the initial point distance and $\Delta$ is the initial function value gap. Tight lower complexity bounds for the three cases are also derived and they match the upper bounds up to logarithmic factors, thereby demonstrating the optimality of both the upper and lower bounds proposed in this paper.

Keywords:

composite optimization first order method upper boundlower bound

MSC:

90C2590C2690C4690C60

1 Introduction

In this paper, we consider the composite optimization problem:

\min_{x}\ F(x)\mathrel{\mathop{:}}=f(x)+h(Ax-b),

(1)

where $f:\mathbb{R}^{n}\mapsto\mathbb{R}$ is a differentiable function whose gradient is $L_{f}$ -Lipschitz continuous, $h:\mathbb{R}^{n}\mapsto\mathbb{R}$ is a proper closed convex function whose proximal operator can be efficiently evaluated, $A\in\mathbb{R}^{m\times n}$ is a matrix with full row rank and its minimal singular value is $\mu_{A}$ . We denote the condition number of matrix $A$ as $\kappa_{A}\mathrel{\mathop{:}}=\frac{\|A\|}{\mu_{A}}$ . Problem (1) appears in many practical applications, such as distributed optimization shi2015extra ; chang2016asynchronous ; hong2017prox , signal and image processing yang2011alternating ; zhu2008efficient ; chambolle2011first , network optimization feijer2010stability and optimal control bertsekas2012dynamic . In this paper, we aim to figure out the tight upper and lower complexity bounds for all three cases of strongly convex, convex, and non-convex function $f$ .

When $h(\cdot)$ is chosen to be the indicator function $h(\cdot)=\mathbb{I}_{\{0\}}(\cdot)$ , the problem becomes the linear equality constrained problem

\min_{x}f(x),\quad\,\textrm{s.t.}\,Ax=b.

(2)

More generally, when $h(\cdot)$ is chosen to be $h(\cdot)=\mathbb{I}_{\mathcal{K}}(\cdot)$ where $\mathcal{K}$ is a proper cone, the problem becomes the conic inequality constrained problem

\min_{x}\ f(x),\quad\,\textrm{s.t.}\,Ax-b\in\mathcal{K}.

(3)

Alternatively, when $h(\cdot)=\left\|\cdot\right\|_{*}$ is a norm function, the problem becomes the norm regularized problem

\min_{x}\ f(x)+\|Ax-b\|_{*}.

(4)

The motivation of this paper originates from the study of the complexity of the non-convex optimization with linear equality constraints (2), which corresponds to the special case of (1) when $h(\cdot)=\mathbb{I}_{\{0\}}(\cdot)$ and $f(x)$ is non-convex. This problem has received considerable attention in the literature. The authors of kong2019complexity combine the proximal point method and the quadratic penalty method, archiving an $O(\epsilon^{-3})$ complexity for finding an $\epsilon$ -stationary point. Under the assumption that the domain is bounded, the authors of kong2023iteration develop an algorithm that solves proximal augmented Lagrangian subproblem by accelerated composite gradient method, resulting in a complexity of $\tilde{O}(\epsilon^{-2.5})$ .

Several recent works have further improved the complexity to $O(\epsilon^{-2})$ , including hong2017prox ; sun2019distributed ; zhang2020proximal ; zhang2022global . The key factor contributing to this improvement is the incorporation of the minimal nonzero singular value in the analysis of complexity bounds. Specifically, the condition number of $A$ is required to be bounded by some constant. In hong2017prox , a proximal primal-dual algorithm is developed, while the dependence on $\kappa_{A}$ is very large. In sun2019distributed , a novel optimal method for distributed optimization is proposed. Yet its dependence on Chebyshev acceleration prevents its generalization to problems with $h(\cdot)\neq\mathbb{I}_{\{0\}}(\cdot)$ . The authors of zhang2020proximal focus on the problem with an extra box constraint while it requires assuming the complementarity property, which is not easy to check in advance. The subsequent work zhang2022global relaxes the assumption, but the dependence on other parameters remains unclear.

Therefore, it is natural to ask what are the complexity upper and lower bounds of first-order methods for the non-convex linearly constrained problem (2). To answer this question, we consider the general form problem (1) and utilize the norm of the proximal gradient mapping $L_{f}\left\|x-\mathbf{prox}_{\frac{1}{L_{f}}h(A\cdot-b)}\left(x-\frac{1}{2L_{f}}\nabla f(x)\right)\right\|$ as the convergence measure. When $h$ reduces to an indicator function $h(\cdot)=\mathbb{I}_{\{0\}}(\cdot)$ which corresponds to the linear equality constrained problem (2), this error measure reduces to the norm of the projected gradient mapping $L_{f}\left\|x-\mathcal{P}_{\{Ax=b\}}\left(x-\frac{1}{2L_{f}}\nabla f(x)\right)\right\|$ . We rigorously define the first-order linear span algorithm class and construct a hard instance to establish a lower bound of $\Omega\left(\kappa_{A}L_{f}\Delta/\epsilon^{2}\right)$ where $\Delta$ quantifies the gap in the objective function at the initial point. To demonstrate the tightness of the lower bound, we design an efficient algorithm based on the idea of accelerated proximal point algorithm (APPA), which achieves the bound of $\tilde{O}\left(\kappa_{A}L_{f}\Delta/\epsilon^{2}\right)$ .

The APPA approach solves a non-convex problem by successively solving a sequence of strongly convex subproblems. Hence, the optimality of the complexity for solving problem (1) with strongly convex $f$ is also crucial. One straightforward approach is to introduce an additional variable $y=Ax-b$ and apply ADMM to solve the linear constrained problem where one function in the objective is strongly convex and the other is convex. For example, a sublinear convergence rate is obtained in cai2017convergence and lin2015sublinear . The lower bound is examined by constructing a challenging linear equality constrained problem in ouyang2021lower . Assuming a strongly convex parameter $\mu_{f}$ , the derived lower bound is $\Omega\left(\frac{\|A\|\|\lambda^{\star}\|}{\mu_{f}\sqrt{\epsilon}}\right)$ where $\lambda^{\star}$ is the optimal dual variable. Remarkably, this dominant term matches the upper bound provided in xu2021iteration up to logarithmic factors.

If we further incorporate the minimal nonzero singular value of $A$ into the complexity bound, it is possible to “break” the lower bound established in ouyang2021lower . For example, the ADMM and a set of primal-dual methods are proved to have linear convergence rates (lin2015global, ; zhu2022unified, , etc.). Furthermore, for the linear equality constrained problem (2) with strongly convex $f$ , the algorithms proposed in zhu2022unified and salim2022optimal achieve complexity of $O\left((\kappa_{A}^{2}+\kappa_{f})\log(1/\epsilon)\right)$ and $O\left(\kappa_{A}\sqrt{\kappa_{f}}\log(1/\epsilon)\right)$ , respectively. However, salim2022optimal utilizes Chebyshev iteration as a sub-routine, and is hence restricted to the linear equality constrained problems. An $\Omega\left(\kappa_{A}\sqrt{\kappa_{f}}\log(1/\epsilon)\right)$ lower bound is also claimed for the linear equality constrained case in salim2022optimal .

For completeness, we also discuss the case with general convex $f$ , which has been extensively investigated. The authors of ouyang2015accelerated and xu2017accelerated design algorithms that converge to an $\epsilon$ -optimal solution with a complexity of $O(\epsilon^{-1})$ . zhu2022unified proposes a unified framework that covers several well-known primal-dual algorithms and establishes an ergodic convergence rate of $O(\epsilon^{-1})$ . The lower bound is provided in ouyang2021lower and the dominant term is $\Omega\left({\|A\|\|x^{\star}\|\|\lambda^{\star}\|}/{\epsilon}\right)$ which matches the upper bounds provided in ouyang2015accelerated . Recent researches also establish complexity results better than $O(\epsilon^{-1})$ under stronger assumptions. On the linear constrained problem with $O(1)$ constraints, xu2022first utilizes the cutting-plane method as the subroutine of ALM and proves a dimension-dependent complexity of $O(1/\sqrt{\epsilon})$ . For a class of bilinear and smooth minimax problems (which include linear constrained problems), song2020breaking proposes an algorithm that achieves $O\left(1/\sqrt{\epsilon}\right)$ complexity, but its dependence on other constants remains suboptimal. Under our setting, we demonstrate that both lower and upper bounds can achieve $\tilde{\Theta}(\kappa_{A}D\sqrt{L/\epsilon})$ by deducing from the strongly convex case.

Contribution. Our results are listed in Table 1 and our main contributions are summarized as follows.

•

Under the assumption of bounded $\kappa_{A}$ , we establish the complexity lower bounds for solving (1) with non-convex, strongly convex, and convex $f$ , within the first-order linear span algorithm class, by constructing three hard linear equality constrained instances.
•

By exploiting the idea of the (accelerated) proximal point algorithm, we design efficient algorithms to solve problem (1) in all three cases.
•

Under the assumption of bounded $\kappa_{A}$ , we prove that the complexities of our algorithms match the lower bounds up to logarithmic factors. Therefore, these complexities are optimal.
•

For the special case of linear equality constrained problem (2), we prove that the full row rank assumption can be removed. The upper bounds keep unchanged except that $\kappa_{A}$ is replaced by $\underline{\kappa}_{A}=\|A\|/\underline{\sigma}_{\min}$ , with $\underline{\sigma}_{\min}$ being the minimum nonzero singular value.

Setting	Optimality Measure	Complexity
Strongly Convex	$\\|x-x^{\star}\\|$	$\tilde{O}\left(\kappa_{A}\sqrt{\kappa_{f}}\log\left(1/{\epsilon}\right)\right)$
		$\Omega\left(\kappa_{A}\sqrt{\kappa_{f}}\log\left(1/{\epsilon}\right)\right)$
Convex	$f(x)+h_{\rho}(Ax-b)-\min_{x}F(x)$	$\tilde{O}\left(\kappa_{A}\sqrt{L_{f}}D/\sqrt{\epsilon}\right)$
		$\Omega\left(\kappa_{A}\sqrt{L_{f}}D/\sqrt{\epsilon}\right)$
Non-convex	$L_{f}\left\\|x-\mathbf{prox}_{\frac{1}{L_{f}}h(A\cdot-b)}\left(x-\frac{1}{2L_{f}}\nabla f(x)\right)\right\\|$	$\tilde{O}\left(\kappa_{A}L_{f}\Delta/\epsilon^{2}\right)$
		$\Omega\left(\kappa_{A}L_{f}\Delta/\epsilon^{2}\right)$

Table 1: Summary of our results. The gray cells represent the lower bounds.

x^{\star}

: the optimal solution,

x_{0}\in\mathrm{dom}F

: the initial point,

h_{\rho}(\cdot)

: a surrogate function of

h(\cdot)

(defined in (6)),

\mathbf{prox}

: the proximal operator,

L_{f}

: Lipschitz constant of

\nabla f(x)

\mu_{f}

: strong convexity parameter of

f(x)

\kappa_{f}=L_{f}/\mu_{f}

\kappa_{A}=\|A\|/\mu_{A}

F(x_{0})-F(x^{*})\leq\Delta

\|x_{0}-x^{*}\|\leq D

Related works. For this paper, there are several closely related works. The first one that we would like to discuss is song2020breaking . It is designed for a class of smooth bilinear minimax problems, which includes convex linearly constrained problems after proper reformulation. With some extra computation, the results of song2020breaking indicate an $O\big{(}\kappa_{A}(D+D^{*})\sqrt{L_{f}/\epsilon}\big{)}$ complexity, where $D^{*}$ upper bounds the norm of the optimal dual variables. The additional $O\big{(}\kappa_{A}D^{*}\sqrt{L_{f}/\epsilon}\big{)}$ term makes their algorithm suboptimal for our problem class. Second, for linear equality constrained problems, their results require the constraint matrix $A$ to be full rank, while our result does not require this property. In our paper, we provide a tight lower complexity bound for this problem class and an optimal first-order algorithm that achieves the lower bound.

The second closely related work is salim2022optimal , which derived an $O\left(\kappa_{A}\sqrt{\kappa_{f}}\log(1/\epsilon)\right)$ complexity for the smooth strongly convex linear equality constrained problem (2). Let $P(\cdot)$ be some properly selected Chebyshev polynomial, they proposed an accelerated algorithm with Chebyshev iteration based on the equivalence between $Ax=b$ and $\sqrt{P(A^{\top}A)}x=\sqrt{P(A^{\top}A)}x^{*}$ . However, this technique is restricted to linear equality constrained problems as the aforementioned equivalence failed for inequality contraints $\big{(}h(\cdot)=\mathbb{I}_{\mathbb{R}^{n}_{+}}(\cdot)\big{)}$ and more general $h$ . Moreover, their algorithm requires the smoothness of the objective function over the whole space, and they do not allow any constraints other than the linear equality ones. Hence their method cannot handle general non-smooth $h(\cdot)\neq\mathbb{I}_{\{0\}}(\cdot)$ by introducing $y=Ax-b$ and sloving an equality constrained problem. This is because the new $h(y)$ term violates both the global smoothness and joint strong convexity assumptions in their paper. It is worth mentioning that the $\Omega\left(\kappa_{A}\sqrt{\kappa_{f}}\log(1/\epsilon)\right)$ lower bound in salim2022optimal is indeed a valid lower bound for our problem. However, since the construction of our lower bound for the general convex case requires the lower bound for the strongly convex case as an intermediate step, we provide a self-contained lower bound for the strongly convex case in this paper. The hard instance and the structure of the zero chain in our example are both different from those in salim2022optimal .

The last related work that we would like to discuss is sun2019distributed where an optimal first-order algorithm for non-convex distributed optimization is proposed. Like salim2022optimal , their results require the global smoothness of the objective function and the Chebyshev acceleration and is hence hard to generalize to $h(\cdot)\neq\mathbb{I}_{\{0\}}(\cdot)$ . A lower bound is also proposed in sun2019distributed paper for non-convex distributed optimization. However, it is not clear how their lower bound can be reduced to our case and both their lower and upper bounds include an unusual dependence on the squared norm of the initial gradient. Therefore, instead of trying to reduce from their results, we construct a new hard instance that lower bounds the complexity for making the proximal gradient mapping small.

Organization. Section 2 provides fundamental information and defines the quantities to measure the suboptimality of an approximate solution. In Section 3, we discuss the upper bounds for each of the three cases. For the sake of readability, we present an efficient algorithm for the strongly convex case first. Then, we utilize this algorithm to solve the proximal point subproblems of the non-convex case. We also utilize this algorithm to solve the general convex case by adding $O(\epsilon)$ -strongly convex perturbation and obtain a corresponding upper bound. Thus, we follow the order from strongly convex to non-convex to convex in the discussion. Then in Section 4, we present formal definitions of problem classes and algorithm classes and provide the corresponding lower bounds. Finally, in Section 5, we conduct a detailed analysis of the linear equality constrained problem. In particular, we relax the full rank assumption on matrix $A$ . A direct acceleration for the general convex case is also presented in the last section, instead of adding strongly convex perturbations.

Notations. We denote the Euclidean inner product by $\langle\cdot,\cdot\rangle$ and the Euclidean norm by $\|\cdot\|$ . Let $f(\cdot,\cdot):\mathbb{R}^{m}\times\mathbb{R}^{n}\mapsto\mathbb{R}$ be a differentiable function of two variables. To denote the partial gradient of $f$ with respect to the first (or second) variable at the point $(x,y)$ , we use $\nabla_{x}f(x,y)$ (or $\nabla_{y}f(x,y)$ ). The full gradient at $(x,y)$ is denoted as $\nabla f(x,y)$ , where $\nabla f(x,y)=(\nabla_{x}f(x,y),\nabla_{y}f(x,y))$ . Suppose $\mathcal{M}$ is a closed convex set, we use $\mathcal{P}_{\mathcal{M}}(\cdot)$ to represent the projection operator onto $\mathcal{M}$ . For any matrix $A\in\mathbb{R}^{m\times n}$ , we use $\mathcal{R}(A)$ to denote its range space. For any vector $x\in\mathbb{R}^{n}$ , we use $\operatorname{supp}\left\{x\right\}$ to denote the support of $x$ , i.e., $\operatorname{supp}\left\{x\right\}\mathrel{\mathop{:}}=\left\{i\in[n]\mid x_{i}\neq 0\right\}$ , where $[n]=\{1,\cdots,n\}$ . The indicator function for a set $\mathcal{S}$ is defined as $\mathbb{I}_{\mathcal{S}}(x)=0$ if $x\in\mathcal{S}$ and $\mathbb{I}_{\mathcal{S}}(x)=+\infty$ if $x\notin\mathcal{S}$ . The identity matrix in $\mathbb{R}^{d\times d}$ is denoted as $I_{d}$ . The positive part of a real number is represented as $[\cdot]_{+}$ , defined as $[x]_{+}=\max\{x,0\}$ .

2 Preliminaries

2.1 Basic facts and definitions

Lipschitz Smoothness. For a differentiable function $f(x)$ , we say it is $L_{f}$ -smooth if

\|\nabla f(x)-f(x^{\prime})\|\leq L_{f}\|x-x^{\prime}\|,\quad\forall x,x^{\prime}.

Strong Convexity. For a positive constant $\mu_{f}>0$ , we say $f(x)$ is $\mu_{f}$ -strongly convex if $f(x)-\frac{\mu_{f}}{2}\|x\|^{2}$ is convex, and it is $\mu_{f}$ -strongly concave if $-f(x)$ is strongly convex.

Conjugate Function. The conjugate function of a function $f(\cdot)$ is defined as

f^{\star}(y)=\sup_{x}\{x^{\mathrm{T}}y-f(x)\}.

It is well-known that $f^{\star}$ is convex. Furthermore, when $f$ is assumed to be strongly convex or smooth, its conjugate function $f^{\star}$ has the following properties hiriart1996convex .

Lemma 1

If $f$ is $\mu_{f}$ -strongly convex, then its conjugate $f^{\star}$ is $\frac{1}{\mu_{f}}$ -smooth. If $f$ is $L_{f}$ -smooth, then its conjugate $f^{\star}$ is $\frac{1}{L_{f}}$ -strongly convex.

Proximal Operator. For a proper closed convex function $h$ , the corresponding proximal operator is defined as

\mathbf{prox}_{h}(x)\mathrel{\mathop{:}}=\operatorname*{arg\,min}_{u}\left\{h(u)+\frac{1}{2}\|x-u\|^{2}\right\}.

2.2 Suboptimality measure

To study the upper and lower complexity bounds of first-order methods for problem (1), it is essential to define the appropriate suboptimality measures for the strongly convex, convex, and non-convex cases, respectively. In the following sections, we will present these measures one by one.

Strongly convex case. In this case, the optimal solution is unique. We can define the suboptimal measure for a point $x$ as its squared distance to the optimal solution

\operatorname{SubOpt}_{\mathsf{SC}}(x)\mathrel{\mathop{:}}=\|x-x^{\star}\|^{2}.

(5)

Non-convex case. In the non-convex case, we define the suboptimal measure as

\operatorname{SubOpt}_{\mathsf{NC}}(x)\mathrel{\mathop{:}}=L_{f}\left\|x-\mathbf{prox}_{\frac{1}{L_{f}}h(A\cdot-b)}\left(x-\frac{1}{2L_{f}}\nabla f(x)\right)\right\|,

which measures the violation of the first-order optimality condition. It is worth pointing out that the operator $\mathbf{prox}_{\frac{1}{L_{f}}h(A\cdot-b)}$ is not needed in the algorithm. We only utilize it as a formal measure but do not need to directly evaluate it in our algorithms.

Convex case. The convex case is a little bit complicated. The standard suboptimality measure $F(x)-\inf_{x^{\prime}}F(x^{\prime})$ may not fit our setting, since $h(Ax-b)=+\infty$ could happen when $Ax-b\not\in\mathrm{dom}h$ . For example, for linearly constrained problems where $h(Ax-b)=\mathbb{I}_{\{Ax=b\}}(x)$ , typically first-order methods only guarantee returning solutions with small constraint violation instead of exact constraint satisfaction. Thus the objective function gap will always be $+\infty$ . To handle this issue, we propose the following suboptimality measure

\operatorname{SubOpt}_{\mathsf{C}}(x)\mathrel{\mathop{:}}=f(x)+h_{\rho}(Ax-b)-\min_{x^{\prime}}\left\{f(x^{\prime})+h(Ax^{\prime}-b)\right\},

where $h(\cdot)$ is replaced by the surrogate function $h_{\rho}(\cdot)$ given by

h_{\rho}(z)\mathrel{\mathop{:}}=\sup_{\|y\|_{2}\leq\rho}\{z^{\mathrm{T}}y-h^{\star}(y)\}.

(6)

For any $\rho\in(0,+\infty)$ . $h_{\rho}(x)$ can be viewed as a Lipschitz approximation of $h(x)$ , as the following lemma indicates.

Lemma 2

For any $\rho\in(0,+\infty)$ , the followings hold

1.

$h_{\rho}(x)\leq h(x)$ , and $h_{\rho}(x)\to h(x)$ as $\rho\to+\infty$ .
2.

$h_{\rho}(x)$ is $\rho$ -Lipschitz continuous, and if $h(x)$ is $\rho$ -Lipschitz continuous, then $h_{\rho}(x)\equiv h(x)$ .
3.

If $h(x)=\mathbb{I}_{\mathcal{K}}(x)$ where $\mathcal{K}$ is a proper cone, we have $h_{\rho}(x)=\rho\|\mathcal{P}_{\mathcal{K}^{\circ}}(x)\|$ .

Proof

Part 1. $h_{\rho}(x)=\sup_{\|x\|\leq\rho}\{x^{\mathrm{T}}y-h^{\star}(x)\}\leq\sup_{x}\{x^{\mathrm{T}}y-h^{\star}(x)\}\leq h(x)$ .

Part 2. For any $x,x^{\prime}$ , we have

	$\displaystyle h_{\rho}(x)$	$\displaystyle=\sup_{\\|y\\|\leq\rho}\{y^{\mathrm{T}}(x-x^{\prime})+y^{\mathrm{T}}x^{\prime}-h^{\star}(y)\}$
		$\displaystyle\leq\sup_{\\|y\\|\leq\rho}y^{\mathrm{T}}(x-x^{\prime})+\sup_{\\|y\\|\leq\rho}\{y^{\mathrm{T}}x^{\prime}-h^{\star}(y)\}$
		$\displaystyle\leq\rho\\|x-x^{\prime}\\|+h_{\rho}(x^{\prime}),$

which implies that $h_{\rho}(x)$ is $\rho$ -Lipschitz continuous.

For any $x$ , let $y^{\star}(x)=\operatorname*{arg\,max}_{y}\{x^{\mathrm{T}}y-h^{\star}(y)\}$ , then we have $y^{\star}(x)\in\partial h(x)$ . It holds $\|y^{\star}(x)\|\leq\rho$ if $h(x)$ is $\rho$ -Lipschitz continuous. Hence, we have $h_{\rho}(x)\equiv h(x)$ .

Part 3. When $h(x)=\mathbb{I}_{\mathcal{K}}(x)$ , the conjugate function is $h^{\star}(y)=\mathbb{I}_{\mathcal{K}^{\circ}}(y)$ where $\mathcal{K}^{\circ}$ is the polar cone of $\mathcal{K}$ . Therefore, we have

h_{\rho}(x)=\sup\limits_{\|y\|\leq\rho,y\in\mathcal{K}^{\circ}}x^{\mathrm{T}}y=x^{\mathrm{T}}\left(\frac{\rho}{\|\mathcal{P}_{\mathcal{K}^{\circ}}(x)\|}\mathcal{P}_{\mathcal{K}^{\circ}}(x)\right)=\rho\|\mathcal{P}_{\mathcal{K}^{\circ}}(x)\|.

∎

Remark 1

For norm regularized problems (4) where $h(x)=\|x\|_{*}$ is an arbitrary norm, let $\left\|\cdot\right\|^{*}$ be its dual norm, then the conjugate function $h^{\star}(y)=\mathbb{I}_{\left\|y\right\|^{*}\leq 1}(y)$ . As long as $\rho\left\|x\right\|^{*}\geq\left\|x\right\|,\forall x$ , it holds that

h_{\rho}(x)=\sup\limits_{\|y\|\leq\rho,\left\|y\right\|^{*}\leq 1}x^{\mathrm{T}}y=\sup\limits_{\left\|y\right\|^{*}\leq 1}x^{\mathrm{T}}y=\left\|x\right\|_{*}=h(x).

In particular, when $h(\cdot)=\|\cdot\|$ , it is sufficient to take $\rho\geq 1$ .

Remark 2

For the linear inequality constrained problems (2) with $h(x)=\mathbb{I}_{\{x\leq 0\}}$ , Lemma 2 indicates that $h_{\rho}(Ax-b)=\rho\left\|[Ax-b]_{+}\right\|.$ Therefore, we have

\displaystyle\operatorname{SubOpt}_{\mathsf{C}}(x)=f(x)-f(x^{\star})+\rho\left\|[Ax-b]_{+}\right\|,

Let $x^{\star}$ and $\lambda^{\star}$ be the optimal primal and dual solutions respectively, it is a standard lemma that (see e.g. (zhu2022unified, , Lemma 3)) when $\rho\geq 2\left\|\lambda^{\star}\right\|$ , we have

\displaystyle\max\left\{\left|f(x)-f(x^{\star})\right|,\rho\left\|[Ax-b]_{+}\right\|\right\}\leq 2\operatorname{SubOpt}_{\mathsf{C}}(x).

Similarly, for linear equality constrained problems with $h(x)=\mathbb{I}_{\{0\}}(x)$ , we have $h_{\rho}(Ax-b)=\rho\|Ax-b\|$ . Then $\max\left\{\left|f(x)-f(x^{\star})\right|,\rho\left\|[Ax-b]\right\|\right\}\leq 2\operatorname{SubOpt}_{\mathsf{C}}(x)$ , if $\rho\geq\|\lambda^{\star}\|$ . Therefore, $\operatorname{SubOpt}_{\mathsf{C}}(x)$ essentially agrees with the widely used suboptimality measure of convex linear constrained problems, see e.g. xu2021first ; zhu2022unified ; hamedani2021primal . Furthermore, we can also cover the problem formulation that with both equality and inequality constraints studied in zhang2022global ; zhang2020proximal , if we let $h(\cdot)=\mathbb{I}_{\{0\}\times\{x\leq 0\}}(\cdot)$ .

2.3 Nesterov’s accelerated gradient descent

In this paper, we will frequently use Nesterov’s accelerated gradient descent (AGD) as a subroutine, as stated in Algorithm 1. It is the optimal first-order algorithm for the smooth strongly convex optimization problems nesterov2018lectures .

1 Input: objective function

\Psi

, Lipschitz constant

L

, strong convexity parameter

\mu

, initial point

y_{0}

and tolerance

\delta>0

2 Initialize:

\kappa\leftarrow\frac{L}{\mu},\theta\leftarrow\frac{\sqrt{\kappa}-1}{\sqrt{\kappa}+1},y_{-1}\leftarrow y_{0},k\leftarrow 0

3 repeat

\tilde{y}_{k+1}\leftarrow y_{k}+\theta(y_{k}-y_{k-1})

y_{k+1}\leftarrow\tilde{y}_{k+1}-\frac{1}{L}\nabla\Psi(\tilde{y}_{k+1})

k\leftarrow k+1

7until $\left\|\nabla\Psi(y_{T})\right\|\leq\mu\delta$ ;

Output:

y_{T}

Algorithm 1

\mathsf{AGD}(\Psi,y_{0},\delta)

The following proposition is a simple corollary of (nesterov2018lectures, , Theorem 2.2.2).

Proposition 1

Assume that $\Psi$ is $L$ -smooth and $\mu$ -strongly convex. Denote $\kappa=\frac{L}{\mu}$ and $y^{\star}=\operatorname*{arg\,min}_{y}\Psi(y)$ . Algorithm 1 $\mathsf{AGD}(\Psi,y_{0},\delta)$ terminates in

\displaystyle T\leq 2\sqrt{\kappa}\log\left(\frac{2\kappa\left\|y_{0}-y^{\star}\right\|}{\delta}\right)

iterations, and the output $y_{T}$ satisfies $\left\|y_{T}-y^{\star}\right\|\leq\delta$ .

3 Upper bounds

3.1 Strongly convex case: Intuition

Though $\mathbf{prox}_{h}(x)$ is easy to evaluate, $h(Ax-b)$ may not have efficient proximal mapping. Therefore, by utilizing the conjugate function of $h$ , we decouple the composite structure by reformulating (1) as a convex-concave saddle point problem

\min_{x}\max_{\lambda}\ \mathcal{L}(x,\lambda)\mathrel{\mathop{:}}=f(x)+\lambda^{\mathrm{T}}(Ax-b)-h^{\star}(\lambda).

(7)

Switching the order of $\min$ and $\max$ , and minimizing with respect to $x$ , we obtain the dual problem of (1):

\max_{\lambda}\ \Phi(\lambda)\mathrel{\mathop{:}}=-f^{\star}(-A^{\mathrm{T}}\lambda)-b^{\mathrm{T}}\lambda-h^{\star}(\lambda).

(8)

The following lemma illustrates that the dual problem is strongly concave.

Lemma 3

$\Phi(\lambda)$ is $\mu_{\Phi}$ -strongly concave with $\mu_{\Phi}\mathrel{\mathop{:}}=\mu_{A}^{2}/L_{f}$ .

Proof

Note that $f(x)$ is $L_{f}$ -smooth and $\mu_{f}$ -strongly convex, Lemma 1 indicates that $f^{\star}(\cdot)$ is $\frac{1}{\mu_{f}}$ -smooth and $\frac{1}{L_{f}}$ -strongly convex. Denote $\tilde{f}(\lambda)\mathrel{\mathop{:}}=f^{\star}(-A^{\mathrm{T}}\lambda)$ , then for any $\lambda,\lambda^{\prime}\in\mathbb{R}^{m}$ , we have

	$\displaystyle\tilde{f}(\lambda^{\prime})-\tilde{f}(\lambda)$	$\displaystyle\geq\langle\nabla f^{\star}(-A^{\mathrm{T}}\lambda),-A^{\mathrm{T}}\lambda^{\prime}+A^{\mathrm{T}}\lambda\rangle+\frac{1}{2L_{f}}\\|A^{\mathrm{T}}\lambda-A^{\mathrm{T}}\lambda^{\prime}\\|^{2}$
		$\displaystyle=\langle\nabla\tilde{f}(\lambda),\lambda^{\prime}-\lambda\rangle+\frac{1}{2L_{f}}\\|A^{\mathrm{T}}\lambda-A^{\mathrm{T}}\lambda^{\prime}\\|^{2}$
		$\displaystyle\geq\langle\nabla\tilde{f}(\lambda),\lambda^{\prime}-\lambda\rangle+\frac{\mu_{A}^{2}}{2L_{f}}\\|\lambda-\lambda^{\prime}\\|^{2},$

which implies that $\tilde{f}(\lambda)$ is $(\mu_{A}^{2}/L_{f})$ -strongly convex. Combining the fact that $b^{\mathrm{T}}\lambda+h^{\star}(\lambda)$ is convex, we complete the proof. ∎

One can observe that the Lipschitz continuity of $\nabla f(x)$ is transferred to the strong concavity of $\Phi(\lambda)$ through the matrix $A$ . Therefore, a linear convergence can be expected. To exploit this observation, we perform an inexact proximal point algorithm to solve the dual problem (8):

\lambda_{k}\approx\operatorname*{arg\,max}_{\lambda}\ \Phi_{k}(\lambda)\mathrel{\mathop{:}}=\Phi(\lambda)-\frac{\ell}{2}\|\lambda-\lambda_{k-1}\|^{2},

(9)

where $\lambda_{k-1}$ is the iterate of the $(k-1)$ -th step. Now it remains to solve the subproblem (9) efficiently. By expanding the term $f^{\star}(-A^{\mathrm{T}}\lambda)$ through the conjugate function again, we can rewrite (9) into the equivalent saddle point problem

\max_{\lambda}\min_{x}\ {\mathcal{L}}_{k}(x,\lambda)\mathrel{\mathop{:}}=f(x)+\lambda^{\mathrm{T}}(Ax-b)-h^{\star}(\lambda)-\frac{\ell}{2}\|\lambda-\lambda_{k-1}\|^{2}.

(10)

Let $g_{k}(\lambda)\mathrel{\mathop{:}}=h^{\star}(\lambda)+\frac{\ell}{2}\|\lambda-\lambda_{k-1}\|^{2}$ . Then the dual problem of (9) is given by

\min_{x}\ \Psi_{k}(x)\mathrel{\mathop{:}}=f(x)+g_{k}^{\star}(Ax-b).

(11)

The function $\Psi_{k}(\cdot)$ is $L_{\Psi}$ -smooth and $\mu_{f}$ -strongly convex, with $L_{\Psi}=L_{f}+\frac{L_{A}^{2}}{\ell}$ . This time, the strong convexity induced by the proximal term for the dual variable $\lambda$ is transferred to the Lipschitz smoothness in the primal variable $x$ . With a few more computations, we show that $\nabla\Psi_{k}$ can be easily evaluated, and hence AGD can be applied to solve (11).

Proposition 2

The gradient $\nabla\Psi_{k}(\cdot)$ can be evaluated with one call of $A$ , one call of $A^{\mathrm{T}}$ , one call of $\mathbf{prox}_{\ell h}(\cdot)$ , and one call of $\nabla f(\cdot)$ , respectively.

Proof

With $\lambda_{k}^{\star}(x)\mathrel{\mathop{:}}=\operatorname*{arg\,max}_{\lambda}\mathcal{L}_{k}(x,\lambda)$ , Danskin’s theorem indicates that

\nabla\Psi_{k}(x)=\nabla_{x}\mathcal{L}_{k}(x,\lambda_{k}^{\star}(x))=\nabla f(x)+A^{\mathrm{T}}\lambda_{k}^{\star}(x).

In fact, $\lambda_{k}^{\star}(x)$ can be explicitly written as

\begin{split}\lambda_{k}^{\star}(x)&=\operatorname*{arg\,max}_{\lambda}\ \left\{-h^{\star}(\lambda)-{\frac{\ell}{2}}\left\|\lambda-\lambda_{k-1}-\frac{Ax-b}{\ell}\right\|^{2}\right\}\\ &=\mathbf{prox}_{\frac{h^{\star}}{\ell}}\left(\lambda_{k-1}+\frac{Ax-b}{\ell}\right)\\ &=\frac{1}{\ell}\mathbf{prox}_{{\ell}h}\left({\ell}\lambda_{k-1}+Ax-b\right)-\lambda_{k-1}-\frac{Ax-b}{\ell},\end{split}

(12)

where the last equality comes from Moreau’s decomposition theorem (showalter1997monotone, , Proposition IV.1.8). ∎

Therefore, $\nabla\Psi_{k}(x)$ can be easily evaluated and we can apply Algorithm 1 to efficiently obtain $x_{k}\approx\operatorname*{arg\,min}_{x}\Psi_{k}(x)$ and then update $\lambda_{k}=\lambda_{k}^{\star}(x_{k})$ by (12), which will be proved to be an approximate solution of the subproblem (9).

3.2 Strongly convex case: Analysis

Now, we are ready to give the complete algorithm for strongly convex problems.

1 Input: initial point

x_{0},\lambda_{0}

, smoothness parameter

L_{f}

, strong convexity parameter

\mu_{f}

, minimal singular value

\mu_{A}

, regularization parameter

\ell\geq\mu_{\Phi}

, radius parameter

D

2 Initialize:

\rho=\frac{\mu_{\Phi}}{12\ell},\delta_{k}=(1-\rho)^{\frac{k}{2}}D

3 for $k=1,\dots,T$ do

4 Update

x_{k}\leftarrow\mathsf{AGD}(\Psi_{k},x_{k-1},\delta_{k})

5 Update

\lambda_{k}

by (12):

\lambda_{k}\leftarrow\mathbf{prox}_{\frac{h^{\star}}{\ell}}\left(\lambda_{k-1}+\frac{Ax_{k}-b}{\ell}\right).

Output:

x_{T}

Algorithm 2 Inexact PPA for strongly convex problems

Proposition 3

Suppose that $f(x)$ is $L_{f}$ -smooth and $\mu_{f}$ -strongly convex, the matrix $A$ satisfies $\left\|A\right\|\leq L_{A}$ and the minimum singular value of $A$ is no smaller than $\mu_{A}$ . Let $(x^{\star},\lambda^{\star})$ be the pair of optimal primal and dual variables, and $\{(x_{k},\lambda_{k})\}$ be the iterate sequence generated by Algorithm 2. Assume that $\ell\geq\mu_{\Phi}$ , then we have for any $0\leq k\leq T$ , it holds that

\left\|\lambda_{k}-\lambda^{\star}\right\|\leq(1-\rho)^{t/2}\cdot M,

(13)

where $M=\max\left\{\left\|\lambda_{0}-\lambda^{\star}\right\|,10L_{A}\mu_{\Phi}^{-1}D\right\}$ . If we denote $x_{k}^{\star}=\operatorname*{arg\,min}_{x}\Psi_{k}(x)$ , then for any $2\leq k\leq T$ , it holds that

	$\displaystyle\left\\|x_{k-1}-x^{\star}_{k}\right\\|\leq$	$\displaystyle~{}\frac{L_{A}}{\mu_{f}}\left\\|\lambda_{k-1}-\lambda_{k-2}\right\\|+\delta_{k-1},$		(14)
	$\displaystyle\left\\|x_{k}-x^{\star}\right\\|\leq$	$\displaystyle~{}\frac{L_{A}}{\mu_{f}}\left\\|\lambda_{k-1}-\lambda^{\star}\right\\|+\delta_{k}.$		(15)

Proof

Let $\lambda^{\star}_{k}=\operatorname*{arg\,max}_{\lambda}\Phi_{k}(\lambda)$ and $x_{k}^{\star}=\operatorname*{arg\,min}_{x}\Psi_{k}(x)$ . Recall that the output $x_{k}$ of Algorithm $\mathsf{AGD}(\Psi_{k},x_{k-1},\delta_{k})$ guarantees that $\left\|x_{k}-x^{\star}_{k}\right\|\leq\delta_{k}$ . According to the dual update rule (12) and the optimality of $(x^{\star}_{k},\lambda^{\star}_{k})$ , we have

\lambda_{k}^{\star}=\mathbf{prox}_{\frac{h^{\star}}{\ell}}\left(\lambda_{k-1}+\frac{Ax_{k}^{\star}-b}{\ell}\right),\quad\lambda_{k}=\mathbf{prox}_{\frac{h^{\star}}{\ell}}\left(\lambda_{k-1}+\frac{Ax_{k}-b}{\ell}\right).

Thus, we can deduce $\left\|\lambda_{k}-\lambda^{\star}_{k}\right\|\leq\left\|\frac{Ax_{k}-Ax^{\star}_{k}}{\ell}\right\|\leq\frac{L_{A}\delta_{k}}{\ell}$ due to the non-expansiveness of proximal operator. By the definition of $\lambda^{\star}_{k}$ , it holds that

\displaystyle 0\in\partial\Phi_{k}(\lambda^{\star}_{k})=\partial\Phi(\lambda^{\star}_{k})+\ell(\lambda^{\star}_{k}-\lambda_{k-1}),

which implies that there exists $\Phi^{\prime}_{k}\in\partial\Phi(\lambda^{\star}_{k})$ satisfying $\Phi^{\prime}_{k}+\ell(\lambda^{\star}_{k}-\lambda_{k-1})=0$ . Combining the fact that $\Phi$ is $\mu_{\Phi}$ -strongly concave and $\lambda^{\star}=\operatorname*{arg\,max}_{\lambda}\Phi(\lambda)$ , we have

	$\displaystyle\mu_{\Phi}\left\\|\lambda^{\star}_{k}-\lambda^{\star}\right\\|^{2}\leq$	$\displaystyle~{}\left\langle{\Phi^{\prime}_{k}},{\lambda^{\star}_{k}-\lambda^{\star}}\right\rangle=\ell\left\langle{\lambda_{k-1}-\lambda^{\star}_{k}},{\lambda^{\star}_{k}-\lambda^{\star}}\right\rangle$
	$\displaystyle=$	$\displaystyle~{}\frac{\ell}{2}\left(\left\\|\lambda_{k-1}-\lambda^{\star}\right\\|^{2}-\left\\|\lambda^{\star}_{k}-\lambda^{\star}\right\\|^{2}-\left\\|\lambda^{\star}_{k}-\lambda_{k-1}\right\\|^{2}\right).$

Hence, for any constant $c>0$ , we have

\displaystyle\frac{\ell}{\ell+2\mu_{\Phi}}\left\|\lambda_{k-1}-\lambda^{\star}\right\|^{2}\geq\left\|\lambda^{\star}_{k}-\lambda^{\star}\right\|^{2}\geq(1+c)^{-1}\left\|\lambda_{k}-\lambda^{\star}\right\|^{2}-c^{-1}\left\|\lambda^{\star}_{k}-\lambda_{k}\right\|^{2}.

Picking $c=\frac{\mu_{\Phi}}{6\ell}$ and utilizing the assumption that $\ell\geq\mu_{\Phi}$ yields that

\displaystyle\left\|\lambda_{k}-\lambda^{\star}\right\|^{2}\leq\left(1-\frac{\mu_{\Phi}}{6\ell}\right)\left\|\lambda_{k-1}-\lambda^{\star}\right\|^{2}+\frac{7\ell}{\mu_{\Phi}}(C_{1}\delta_{k})^{2},

where we denote $C_{1}=L_{A}/\ell$ . Therefore, by recursively applying the inequality above, we have

\left\|\lambda_{k}-\lambda^{\star}\right\|^{2}\leq(1-2\rho)^{k}\left\|\lambda_{0}-\lambda^{\star}\right\|^{2}+\frac{7}{12\rho}\sum_{t=1}^{k}(1-2\rho)^{k-t}(C_{1}\delta_{t})^{2}.

By our choice of $\delta_{t}=(1-\rho)^{t/2}D$ , it holds

\displaystyle\sum_{t=1}^{k}(1-2\rho)^{k-t}(\delta_{t})^{2}\leq\left[(1-\rho)^{k}-(1-2\rho)^{k}\right]\cdot\frac{D^{2}}{\rho}.

Putting these pieces together and plugging in the definition of $C_{1}$ and $\rho$ , we get (13).

Recall the definition of $\Psi_{k}$ given in (11). We can it rewrite it into

\displaystyle\Psi_{k}(x)=f(x)+\hat{h}(Ax-b+\ell\lambda_{k-1})-\frac{\ell}{2}\|\lambda_{k-1}\|^{2},

where $\hat{h}$ is defined by

\displaystyle\hat{h}(u)=\max_{\lambda}\left(\lambda^{\mathrm{T}}u-h^{\star}(\lambda)-\frac{\ell}{2}\left\|\lambda\right\|^{2}\right).

Note that $\hat{h}(u)$ is the conjugate function of $h^{\star}(\lambda)+\frac{\ell}{2}\|\lambda\|^{2}$ , and hence it follows from Lemma 1 that $\hat{h}(u)$ is $\frac{1}{\ell}$ -smooth. By definition, we know $\nabla\Psi_{k}(x^{\star}_{k})=\nabla\Psi_{k-1}(x^{\star}_{k-1})$ , and hence

	$\displaystyle\mu_{f}\left\\|x^{\star}_{k}-x^{\star}_{k-1}\right\\|\leq$	$\displaystyle~{}\left\\|\nabla\Psi_{k}(x^{\star}_{k})-\nabla\Psi_{k}(x^{\star}_{k-1})\right\\|=\left\\|\nabla\Psi_{k-1}(x^{\star}_{k-1})-\nabla\Psi_{k}(x^{\star}_{k-1})\right\\|$
	$\displaystyle=$	$\displaystyle~{}\left\\|A^{\mathrm{T}}\nabla\hat{h}(Ax^{\star}_{k-1}-b+\ell\lambda_{k-2})\!-\!A^{\mathrm{T}}\nabla\hat{h}(Ax^{\star}_{k-1}-b+\ell\lambda_{k-1})\right\\|$
	$\displaystyle\leq$	$\displaystyle~{}\left\\|A\right\\|\cdot\frac{1}{\ell}\cdot\ell\left\\|\lambda_{k-2}-\lambda_{k-1}\right\\|=\left\\|A\right\\|\left\\|\lambda_{k-2}-\lambda_{k-1}\right\\|,$

where the first inequality follows from the $\mu_{f}$ -strong-convexity of $f$ , and the last inequality holds because $\hat{h}$ is $\frac{1}{\ell}$ -smooth. Combining the fact that $\left\|x_{k-1}-x^{\star}_{k}\right\|\leq\delta_{k}+\left\|x^{\star}_{k-1}-x^{\star}_{k}\right\|$ , we get (14).

Similarly, by the optimality of $(x^{\star},\lambda^{\star})$ , we have $x^{\star}$ is the minimum of

\displaystyle\Psi_{\star}(x)\mathrel{\mathop{:}}=f(x)+\hat{h}(Ax-b+\ell\lambda^{\star})-\frac{\ell}{2}\|\lambda^{\star}\|^{2}.

Repeating the argument above yields (15). ∎

Theorem 3.1

Under the same assumption of Proposition 3 and $\left\|x_{0}-x^{\star}\right\|\leq D$ , in each iteration of AGD, the number of inner iteration $T_{k}$ of $\mathsf{AGD}(\Psi_{k},x_{k-1},\delta_{k})$ can be upper bound by

\displaystyle T_{k}\leq 8\sqrt{\frac{L_{f}+\ell^{-1}L_{A}^{2}}{\mu_{f}}}\log\left(\frac{10\kappa_{f}\kappa_{A}D_{\star}}{D}\right),

where $D_{\star}\mathrel{\mathop{:}}=\left\|x_{0}-x^{\star}\right\|+\frac{L_{A}}{L_{f}}\left\|\lambda_{0}-\lambda^{\star}\right\|$ . Furthermore, for Algorithm 2 to find an approximate solution $x_{T}$ satisfying $\left\|x_{T}-x^{\star}\right\|\leq\epsilon$ , the number of outer iterations is

\displaystyle T\leq\frac{12\ell}{\mu_{\Phi}}\log\left(100\kappa_{f}\kappa_{A}\cdot\frac{D}{\epsilon}\right).

One comment here is that although we assume knowing an upper bound $D$ of the distance from $x_{0}$ to the optimal solution $x^{*}$ , it only appears in the logarithmic terms. Hence one can use a very loose upper estimation of the distance in the algorithm without deteriorating the performance.

Proof

We first consider the case $k\geq 2$ . By combining (14) and 13, we have

\displaystyle\left\|x_{k-1}-x^{\star}_{k}\right\|\leq\frac{L_{A}}{\mu_{f}}\cdot 2(1-\rho)^{k/2-1}M+\delta_{k-1}\leq(1-\rho)^{k/2}\left(2D+\frac{3L_{A}}{\mu_{f}}M\right).

where the last inequality holds because $0<\rho\leq\frac{1}{12}$ . Therefore, by Proposition 1 and our choice $\delta_{k}=(1-\rho)^{k/2}D$ , the inner number of steps $T_{k}$ of $\mathsf{AGD}(\Psi_{k},x_{k-1},\delta_{k})$ can be upper bound by

	$\displaystyle T_{k}\leq$	$\displaystyle~{}2\sqrt{\frac{L_{\Psi}}{\mu_{\Psi}}}\log\left(\frac{2L_{\Psi}}{\mu_{\Psi}}\cdot\frac{\left\\|x_{k-1}-x^{\star}_{k}\right\\|}{\delta_{k}}\right)$
	$\displaystyle\leq$	$\displaystyle~{}2\sqrt{\frac{L_{\Psi}}{\mu_{\Psi}}}\log\left(\frac{2L_{\Psi}}{\mu_{\Psi}}\cdot\left(2+\frac{3L_{A}}{\mu_{f}}\frac{M}{D}\right)\right)$
	$\displaystyle\leq$	$\displaystyle~{}8\sqrt{\frac{L_{f}+\ell^{-1}L_{A}^{2}}{\mu_{f}}}\log\left(10\kappa_{f}\kappa_{A}\left(1+\frac{L_{A}\left\\|\lambda_{0}-\lambda^{\star}\right\\|}{L_{f}D}\right)\right).$

where the last inequality follows from plugging in the definition of $M$ , $L_{\Psi}$ and $\mu_{\Psi}=\mu_{f}$ . The case $k=1$ follows similarly.

By combining (15) and (13), we have

\displaystyle\left\|x_{T}-x^{\star}\right\|\leq\frac{L_{A}}{\mu_{f}}\left\|\lambda_{T-1}-\lambda^{\star}\right\|+\delta_{T}\leq(1-\rho)^{T/2}\left(D+\frac{3L_{A}}{\mu_{f}}M\right).

The desired result follows immediately. ∎

Corollary 1

Suppose that the same assumptions of Proposition 3 hold and $D_{\star}\leq D$ . In order to find an approximate solution $x_{T}$ satisfying $\operatorname{SubOpt}_{\mathsf{SC}}(x_{T})\leq\epsilon$ , the total number of gradient evaluations for Algorithm 2 is bounded by

\tilde{O}\left(\kappa_{A}\sqrt{\kappa_{f}}\log(D/\epsilon)\right).

Proof

According to Theorem 3.1, if we let $\ell=\mu_{\Phi}$ , then the complexity of each inner loop is $\tilde{O}(\kappa_{A}\sqrt{\kappa_{f}})$ , the complexity of outer loop is $\tilde{O}\left(\log(D/\epsilon)\right)$ . Therefore, the overall complexity is $\tilde{O}\left(\kappa_{A}\sqrt{\kappa_{f}}\log(D/\epsilon)\right)$ .

3.3 Non-convex case

For non-convex problems, our algorithm is present in Algorithm 3, which employs the inexact proximal point algorithm in the outer iterations while solving the strongly convex subproblems via Algorithm 2.

1 Input: initial point

x_{0}

, smoothness parameter

L_{f}

, condition number

\kappa_{A}

, subproblem error tolerance

\{\delta_{k}\}

and the maximum iteration number

T

2 for $k=1,\cdots,T$ do

3 Apply Algorithm 2 to find

x_{k}\approx x_{k}^{\star}:=\operatorname*{arg\,min}_{x}\left\{F(x)+L_{f}\|x-x_{k-1}\|^{2}\right\},

(16)

such that

\|x_{k}-x_{k}^{\star}\|\leq\delta_{k}

Output:

\{x_{k}\}_{k=1}^{T}

Algorithm 3 Inexact PPA for non-convex problems

Under a suitable sequence of $\{\delta_{k}\}$ , we provide the convergence rate of Algorithm 3 in the following theorem.

Theorem 3.2

Suppose that $f(x)$ is $L_{f}$ -smooth, the condition number of $A$ is $\kappa_{A}$ . Assume that $F(x_{1}^{\star})-\inf_{x}F(x)\leq\Delta^{\prime}$ , $\delta_{k}=\frac{\sqrt{\Delta^{\prime}/L_{f}}}{2k}$ and $\{x_{k}\}$ be the iterate sequence generated by Algorithm 3. Then we have

\min_{0\leq k<T}\mathrm{SubOpt}(x_{k})\leq\sqrt{\frac{5L_{f}\Delta^{\prime}}{T}}.

In particular, suppose that the initial point $x_{0}\in\mathrm{dom}F$ and $F(x_{0})-\inf_{x}F(x)\leq\Delta$ for some $\Delta>0$ , then taking $\Delta^{\prime}=\Delta$ yields

\min_{0\leq k<T}\mathrm{SubOpt}(x_{k})\leq\sqrt{\frac{5L_{f}\Delta}{T}}.

(17)

Proof

By the definition of $x_{k+1}^{\star}$ , on the one hand, we have $F(x_{k+1}^{\star})+L_{f}\|x_{k+1}^{\star}-x_{k}\|^{2}\leq F(x_{k}^{\star})+L_{f}\|x_{k}^{\star}-x_{k}\|^{2}$ , which yields

\|x_{k+1}^{\star}-x_{k}\|^{2}\leq\frac{1}{L_{f}}(F(x_{k}^{\star})-F(x_{k+1}^{\star}))+\delta_{k}^{2}.

(18)

On the other hand, it holds

x_{k+1}^{\star}=\mathbf{prox}_{\frac{1}{L_{f}}h(A\cdot-b)}\left(x_{k}-\frac{1}{2L_{f}}\nabla f(x_{k+1}^{\star})\right).

Therefore, we have

		$\displaystyle\left\\|x_{k}-\mathbf{prox}_{\frac{1}{L_{f}}h(A\cdot-b)}\left(x_{k}-\frac{1}{2L_{f}}\nabla f(x_{k})\right)\right\\|$
	$\displaystyle\leq$	$\displaystyle\ \\|x_{k}-x_{k+1}^{\star}\\|+\left\\|x_{k+1}^{\star}-\mathbf{prox}_{\frac{1}{L_{f}}h(A\cdot-b)}\left(x_{k}-\frac{1}{2L_{f}}\nabla f(x_{k})\right)\right\\|$
	$\displaystyle\leq$	$\displaystyle\ 2\\|x_{k}-x_{k+1}^{\star}\\|,$

where the last inequality holds because the proximal operator is non-expansive and $f(\cdot)$ is $L_{f}$ -smooth. Combining with (18), we obtain

	$\displaystyle L_{f}^{2}\left\\|x_{k}-\mathbf{prox}_{\frac{1}{L_{f}}h(A\cdot-b)}\left(x_{k}-\frac{1}{2L_{f}}\nabla f(x_{k})\right)\right\\|^{2}$
	$\displaystyle\leq 4L_{f}^{2}\\|x_{k}-x_{k+1}^{\star}\\|^{2}\leq 4L_{f}(F(x_{k}^{\star})-F(x_{k+1}^{\star}))+4L_{f}^{2}\delta_{k}^{2}.$

Summing up the above inequality over $t=1,\cdots,T$ ,

\displaystyle\frac{1}{T}\sum_{k=1}^{T}L_{f}^{2}\left\|x_{k}\!-\!\mathbf{prox}_{\frac{1}{L_{f}}h(A\cdot-b)}\left(x_{k}\!-\!\frac{\nabla f(x_{k})}{2L_{f}}\right)\right\|^{2}\!\leq\!\frac{4L_{f}(F(x_{1}^{\star})\!-\!F(x_{T+1}^{\star}))\!+\!L_{f}\Delta^{\prime}}{T}.

Since $F(x_{1}^{\star})-F(x_{T+1}^{\star})\leq F(x_{1}^{\star})-\inf_{x}F(x)\leq\Delta^{\prime}$ , it holds that

\min_{1\leq k\leq T}L_{f}\left\|x_{k}-\mathbf{prox}_{\frac{1}{L_{f}}h(A\cdot-b)}\left(x_{k}-\frac{\nabla f(x_{k})}{2L_{f}}\right)\right\|\leq\sqrt{\frac{5L_{f}\Delta^{\prime}}{T}}.

Furthermore, under the assumption of $x_{0}\in\mathrm{dom}F$ , it holds $F(x_{1}^{\star})\leq F(x_{0})$ and accordingly $F(x_{1}^{\star})-\inf_{x}F(x)\leq F(x_{0})-\inf_{x}F(x)\leq\Delta$ . This completes the proof. ∎

In Theorem 3.2, we present the first inequality by incorporating the definition of $\Delta^{\prime}$ specifically for the scenario where $x_{0}\notin\mathrm{dom}F$ . This is necessary because in such cases, $F(x_{0})-\inf_{x}F(x)$ can be infinite, as exemplified by $h(x)$ when it is an indicator function. Consequently, it becomes impossible to find a finite $\Delta$ that satisfies $F(x_{0})-\inf_{x}F(x)\leq\Delta$ . By introducing the definition of $\Delta^{\prime}$ , we ensure the existence of a finite $\Delta^{\prime}$ and thus establish a well-defined result.

Corollary 2

Under the same assumption and same choice of $\delta_{k}$ in Theorem 3.2, in order to find an approximate solution $x_{T}$ satisfying $\operatorname{SubOpt}_{\mathsf{NC}}(x_{T})\leq\epsilon$ , the total number of gradient evaluations for Algorithm 3 is bounded by

\tilde{{O}}\left(\frac{\kappa_{A}L_{f}\Delta^{\prime}}{\epsilon^{2}}\right).

Furthermore, if $x_{0}\in\mathrm{dom}F$ , the total number is bounded by $\tilde{{O}}\left(\frac{\kappa_{A}L_{f}\Delta}{\epsilon^{2}}\right)$ .

Proof

By Theorem 3.2, to reach the expected precision, we need $O\left(\frac{L_{f}\Delta^{\prime}}{\epsilon^{2}}\right)$ outer iterations. For each $1\leq k\leq T$ , the function $f(x)+L_{f}\|x-x_{k-1}\|^{2}$ is $2L_{f}$ -strongly-convex and $3L_{f}$ -smooth, and hence its condition number is $O(1)$ . Hence, the number of gradient evaluations in the $k$ -th inner iteration with Algorithm 2 is $\tilde{O}(\kappa_{A}\log(1/\delta_{k}))$ by Corollary 1. Combining the complexities of inner and outer loops, we obtain the $\tilde{{O}}\left(\frac{\kappa_{A}L_{f}\Delta^{\prime}}{\epsilon^{2}}\right)$ overall complexity. For the case $x_{0}\in\mathrm{dom}F$ , the complexity can be derived similarly.∎

3.4 Convex case

For any given $x_{0}$ and $\epsilon>0$ , we construct the following auxiliary problem:

\min_{x}\ f(x)+h(Ax-b)+\frac{\epsilon}{2D^{2}}\|x-x_{0}\|^{2}.

(19)

The smooth part $f(x)+\frac{\epsilon}{2D^{2}}\|x-x_{0}\|^{2}$ is strongly convex and hence we can apply Algorithm 2 to solve the problem. The following corollary illustrates that the approximate solution of (19) is also an approximate solution of the original convex problem and the overall complexity is also optimal.

Corollary 3

Suppose that $f(x)$ is convex, the condition number of $A$ is $\kappa_{A}$ and $\|x_{0}-x^{\star}\|\leq D$ . For any given $0<\rho<+\infty$ , Algorithm 2 can be applied on problem (19) and output an approximate solution $\hat{x}$ satisfying $\operatorname{SubOpt}_{\mathsf{C}}(\hat{x})\leq\epsilon$ , such that the total number of gradient evaluations is bounded by

\tilde{O}\left(\frac{\kappa_{A}\sqrt{L_{f}}D}{\sqrt{\epsilon}}\right),

where $\tilde{O}$ also hides the logarithmic dependence on $\rho$ .

Proof

Denote the exact solution of (19) as $x_{\epsilon}^{\star}$ . We apply Algorithm 2 on (19) and calculate a point $\hat{x}$ such that $\left\|\hat{x}-x_{\epsilon}^{\star}\right\|\leq\delta$ , where we will specify $\delta$ later in proof.

By the optimality of $x_{\epsilon}^{\star}$ , it holds that

\displaystyle F(x_{\epsilon}^{\star})+\frac{\epsilon}{2D^{2}}\left\|x_{\epsilon}^{\star}-x_{0}\right\|^{2}\leq F(x^{\star})+\frac{\epsilon}{2D^{2}}\left\|x^{\star}-x_{0}\right\|^{2}.

In particular, we have $F(x_{\epsilon}^{\star})\leq F(x^{\star})+\epsilon/2$ , and by the fact that $F(x^{\star})\leq F(x_{\epsilon}^{\star})$ , we know $\left\|x_{\epsilon}^{\star}-x_{0}\right\|\leq\left\|x^{\star}-x_{0}\right\|\leq D$ .

On the other hand, we have

	$\displaystyle f(\hat{x})\leq$	$\displaystyle~{}f(x_{\epsilon}^{\star})+\left\langle{\nabla f(x_{\epsilon}^{\star})},{\hat{x}-x_{\epsilon}^{\star}}\right\rangle+\frac{L_{f}}{2}\left\\|\hat{x}-x_{\epsilon}^{\star}\right\\|^{2}$
	$\displaystyle\leq$	$\displaystyle~{}f(x_{\epsilon}^{\star})+\left(\left\\|\nabla f(x_{0})\right\\|+L_{f}D\right)\left\\|\hat{x}-x_{\epsilon}^{\star}\right\\|+\frac{L_{f}}{2}\left\\|\hat{x}-x_{\epsilon}^{\star}\right\\|^{2},$

where the last inequality follows from $\left\|\nabla f(x_{\epsilon}^{\star})-\nabla f(x_{0})\right\|\leq L_{f}\left\|x_{\epsilon}^{\star}-x_{0}\right\|\leq L_{f}D$ . Further, since $h_{\rho}(\cdot)$ is $\rho$ -Lipschitz continuous and $h_{\rho}(\cdot)\leq h(\cdot)$ , it holds

h_{\rho}(A\hat{x}-b)\leq h_{\rho}(Ax_{\epsilon}^{\star}-b)+\rho\left\|A\hat{x}-Ax_{\epsilon}^{\star}\right\|\leq h(Ax_{\epsilon}^{\star}-b)+\rho L_{A}\left\|\hat{x}-x_{\epsilon}^{\star}\right\|.

Denote $C_{\rho}=\left\|\nabla f(x_{0})\right\|+L_{f}D+\rho L_{A}$ . Combining the above two inequalities yields

	$\displaystyle f(\hat{x})+h_{\rho}(A\hat{x}-b)\leq$	$\displaystyle~{}f(x_{\epsilon}^{\star})+h(Ax_{\epsilon}^{\star}-b)+C_{\rho}\left\\|\hat{x}-x_{\epsilon}^{\star}\right\\|+\frac{L_{f}}{2}\left\\|\hat{x}-x_{\epsilon}^{\star}\right\\|^{2}$
	$\displaystyle\leq$	$\displaystyle~{}F(x_{\epsilon}^{\star})+C_{\rho}\delta+\frac{L_{f}\delta^{2}}{2}$
	$\displaystyle\leq$	$\displaystyle~{}F(x^{\star})+\frac{\epsilon}{2}+C_{\rho}\delta+\frac{L_{f}\delta^{2}}{2}.$

Therefore, we can set

\displaystyle\delta=\min\left\{\frac{\epsilon}{3C_{\rho}},\sqrt{\frac{\epsilon}{3L_{f}}}\right\}

to ensure that $\operatorname{SubOpt}_{\mathsf{C}}(\hat{x})\leq\epsilon$ . Notice that the function $f(x)+\frac{\epsilon}{2D^{2}}\left\|x-x_{0}\right\|^{2}$ is $(L_{f}+\epsilon/D^{2})$ -smooth and $(\epsilon/D^{2})$ -strongly convex. Therefore, according to Corollary 1, the required number of gradient evaluations is $\tilde{O}\left(\kappa_{A}\sqrt{\frac{L_{f}D^{2}}{\epsilon}+1}\right)$ . ∎

Note that we give some specific definitions of $h_{\rho}$ in Property 3 and 2. In the following, we give the complexity results on these specific problems.

Corollary 4

For the conic inequality constrained problem (3) and any fixed $\rho>0$ , in order to find an approximate solution $x_{T}$ satisfies

\displaystyle|f(x_{T})-f(x^{\star})|\leq\epsilon,\qquad\|\mathcal{P}_{\mathcal{K}^{\circ}}(Ax_{T}-b)\|\leq\frac{\epsilon}{\rho},

the required number of gradient evaluation is $\tilde{O}\left(\kappa_{A}D\sqrt{L_{f}/\epsilon}\right)$ .

Corollary 4 implies that for conic inequality constrained convex problems (including linearly constrained convex problems), the constraint can be fulfilled to arbitrary accuracy without affecting the order of the complexity (up to log factor).

Corollary 5

When $h(\cdot)$ is $\rho$ -Lipschitz continuous (e.g., the norm regularized problem (4)), in order to find an approximate solution $x_{T}$ satisfies $F(x_{T})-\min_{x}F(x)\leq\epsilon$ , the required number of gradient evaluation is $\tilde{O}\left(\kappa_{A}D\sqrt{L_{f}/\epsilon}\right)$ .

4 Lower bounds

4.1 Problem classes and algorithm class

In this section, we aim to construct three hard instances for the strongly convex, convex, and non-convex cases, respectively. First, let us formally define the problem classes and the linear span first-order algorithm class. For the simplicity of presentation, we construct the hard instances with $x_{0}=0$ and $L_{A}=2$ .

Strongly convex problem class. For positive constants $L_{f}\geq\mu_{f}>0$ , $D,\kappa_{A}>0$ , the problem class $\mathcal{F}_{\mathrm{SC}}(L_{f},\mu_{f},\kappa_{A},D)$ includes problems in which $f(x)$ is $L_{f}$ -smooth and $\mu_{f}$ -strongly convex, $\|x_{0}-x^{\star}\|\leq D$ , and the condition number of $A$ is upper bounded by $\kappa_{A}$ .

Non-convex problem class. For positive constants $L_{f},\kappa_{A},\Delta>0$ and $x_{0}\in\mathrm{dom}F$ , the problem class $\mathcal{F}_{\mathrm{NC}}(L_{f},\Delta,\kappa_{A})$ includes problems where $f(x)$ is $L_{f}$ -smooth, $F(x_{0})-F(x^{\star})\leq\Delta$ and the condition number of $A$ is upper bounded by $\kappa_{A}$ .

Convex problem class. For positive constants $L_{f},D,\kappa_{A}>0$ , the problem class $\mathcal{F}_{\mathrm{C}}(L_{f},\kappa_{A},D)$ includes problems in which $f(x)$ is $L_{f}$ -smooth, $\|x_{0}-x^{\star}\|\leq D$ , and the condition number of $A$ is upper bounded by $\kappa_{A}$ .

For the above three problem classes, we restrict our discussion to first-order linear span algorithms. The results can be extended to the general first-order algorithms without linear span structure by the orthogonal invariance trick proposed in carmon2020lower .

First-order linear span algorithms. The iterate sequence $\{(x_{k},\lambda_{k})\}$ is generated such that $(x_{k},\lambda_{k})\in\mathcal{S}^{x}_{k+1}\times\mathcal{S}^{\lambda}_{k+1}$ . These subspaces are generated by starting with $\mathcal{S}^{x}_{0}=\mathrm{Span}\{x_{0}\}$ , $\mathcal{S}^{\lambda}_{0}=\mathrm{Span}\{\lambda_{0}\}$ and

\begin{split}\mathcal{S}^{x}_{k+1}&\!\mathrel{\mathop{:}}=\!\mathrm{Span}\left\{x_{i},\nabla f(\hat{x}_{i}),A^{\mathrm{T}}\hat{\lambda}_{i}:\forall\hat{x}_{i}\in\mathcal{S}_{i}^{x},\hat{\lambda}_{i}\in\mathcal{S}_{i}^{\lambda},0\leq i\leq k\right\},\\ \mathcal{S}^{\lambda}_{k+1}&\!\mathrel{\mathop{:}}=\!\mathrm{Span}\left\{\lambda_{i},\mathbf{prox}_{\eta_{i}h^{\star}}\left(\hat{\lambda}_{i}+\eta_{i}(A\hat{x}_{j}-b)\right):\forall\hat{x}_{i}\in\mathcal{S}_{i}^{x},\hat{\lambda}_{i}\in\mathcal{S}_{i}^{\lambda},0\leq i\leq k\right\}.\end{split}

If we assume that $\lambda_{0}=0$ , then for the linear equality constrained problem (2), $\mathbf{prox}_{\eta_{i}h^{\star}}(x)=x$ and the algorithm class degenerates into

x_{k+1}\in\mathcal{S}^{x}_{k+1}=\mathrm{Span}\left\{x_{i},\nabla f(\hat{x}_{i}),A^{\mathrm{T}}(A\hat{x}_{i}-b),~{}\forall\hat{x}_{i}\in\mathcal{S}_{i}^{x},0\leq i\leq k\right\}.

We can further assume $x_{0}=0$ without loss of generality, otherwise, we can consider the shifted problem $\min_{x}F(x-x_{0})$ .

Note that for a first-order linear span algorithm, it is not necessary to use the current gradient in each iteration. Instead, it can use any combination of points from the historical search space. This makes the algorithm class general enough to cover diverse iteration schemes. To give some specific examples, we present following single-loop and double-loop algorithms covered under the considered algorithm class.

Example 1 (Single-loop algorithms)

Consider problem (2) with $h(\cdot)=\mathbb{I}_{\{0\}}(\cdot)$ , the Chambolle-Pock method chambolle2011first , the OGDA method mokhtari2020convergence and the linearized ALM xu2021first update iterates by the following rules

\begin{split}\left\{\begin{array}[]{l}x_{k+1}=x_{k}-\eta_{1}\left(\nabla f(x_{k})+A^{\mathrm{T}}\lambda_{k}\right)\\ \lambda_{k+1}=\lambda_{k}+\eta_{2}(2Ax_{k+1}-Ax_{k}-b)\end{array}\right.\end{split}

(Chambolle-Pock)

\begin{split}\left\{\begin{array}[]{l}x_{k+1}=x_{k}-\eta_{1}\left(2\nabla f(x_{k})-\nabla f(x_{k-1})+A^{\mathrm{T}}(2\lambda_{k}-\lambda_{k-1})\right)\\ \lambda_{k+1}=\lambda_{k}+\eta_{2}(2Ax_{k}-Ax_{k-1}-b)\end{array}\right.\end{split}

(OGDA)

\begin{split}\left\{\begin{array}[]{l}x_{k+1}=x_{k}-\eta_{1}\left(\nabla f(x_{k})+A^{\mathrm{T}}\lambda_{k}+\rho A^{\mathrm{T}}(Ax_{k}-b)\right)\\ \lambda_{k+1}=\lambda_{k}+\eta_{2}(Ax_{k+1}-b)\end{array}\right.\end{split}

(Linearized ALM)

where $\rho>0$ is a penalty factor.

For problem class $\mathcal{F}_{SC}(L_{f},\mu_{f},\kappa_{A})$ , a unified analysis on the above three methods was provided in zhu2022unified , and an $O\left((\kappa_{f}+\kappa_{A}^{2})\log\left(\frac{1}{\epsilon}\right)\right)$ complexity is achieved.

Example 2 (Double-loop algorithm xu2021iteration )

Consider problem (2) with $h(\cdot)=\mathbb{I}_{\{0\}}(\cdot)$ . Let $\mathcal{L}_{\rho}(x,\lambda)=f(x)+\lambda^{\mathrm{T}}(Ax-b)+\frac{\rho}{2}\|Ax-b\|^{2}$ be the augmented Lagrangian function. ALM generates iterates by

\left\{\begin{array}[]{l}x_{k+1}\approx\operatorname*{arg\,min}_{x}\mathcal{L}_{\rho}(x,\lambda_{k})\\ \lambda_{k+1}=\lambda_{k}+\rho(Ax_{k+1}-b)\end{array}\right.

where the subproblem is solved by an inner loop of Nesterov’s AGD method. An $O(\epsilon^{-1})$ complexity for convex problems and an $O(\epsilon^{-\frac{1}{2}})$ complexity for strongly convex problems are derived in xu2021iteration .

Remark 3

It can be checked that all three algorithms proposed in the upper bound section belong to the defined first order linear span algorithm class for general $h$ .

4.2 The construction of hard instance

In this section, we construct the hard instances for any first-order linear span algorithms. Specifically, we consider the linear equality constrained problem (2) with $h(\cdot)=\mathbb{I}_{\{0\}}(\cdot)$ . For positive integers $N$ and $d$ , we define the following problem

\displaystyle\begin{aligned} \min_{x\in\mathbb{R}^{2Nd}}&~{}f_{0}(x)\mathrel{\mathop{:}}=G(x[1],x[N+1])+\cdots+G(x[N],x[2N]),\\ \,\textrm{s.t.}\,&~{}x[1]=x[2]=\cdots=x[2N],\end{aligned}

(20)

where $x[1],\cdots,x[2N]\in\mathbb{R}^{d}$ and $x\in\mathbb{R}^{2Nd}$ is the vector that stacks $x[i]$ together in order, the component function $G(u,v):\mathbb{R}^{d}\times\mathbb{R}^{d}\mapsto\mathbb{R}$ is a smooth function to be determined later. To ensure that $f_{0}(x)$ satisfies the assumptions of different problem classes, we will construct various formulations of $G(u,v)$ in the strongly convex, convex, and non-convex cases, respectively. Additionally, we require $G(u,v)$ to satisfy the following assumption.

Assumption 4.1

For any $i\geq 0$ , it holds

1.

If $\operatorname{supp}\left\{u\right\}\subset[i+1],\operatorname{supp}\left\{v\right\}\subset[i]$ , then $\operatorname{supp}\left\{\nabla_{u}G(u,v)\right\}\subset[i+1]$ and $\operatorname{supp}\left\{\nabla_{v}G(u,v)\right\}\subset[i]$ .
2.

If $\operatorname{supp}\left\{u\right\}\subset[i],\operatorname{supp}\left\{v\right\}\subset[i]$ , then $\operatorname{supp}\left\{\nabla_{u}G(u,v)\right\}\subset[i+1]$ and $\operatorname{supp}\left\{\nabla_{v}G(u,v)\right\}\subset[i]$ .

The constraint in (20) can be rewritten as $Ax=0$ with

A=\left[\begin{array}[]{ccccc}I_{d}&-I_{d}&&&\\ &\ddots&\ddots&&\\ &&I_{d}&-I_{d}&\\ &&&I_{d}&-I_{d}\end{array}\right]\in\mathbb{R}^{(2N-1)d\times 2Nd}.

(21)

Hence $AA^{\mathrm{T}}\in\mathbb{R}^{(2N-1)d\times(2N-1)d}$ and $A^{\mathrm{T}}A\in\mathbb{R}^{2Nd\times 2Nd}$ can be computed as

AA^{\mathrm{T}}=\left[\begin{array}[]{ccccc}2I_{d}&-I_{d}&&&\\ -I_{d}&2I_{d}&-I_{d}&&\\ &\ddots&\ddots&\ddots&\\ &&-I_{d}&2I_{d}&-I_{d}\\ &&&-I_{d}&2I_{d}\end{array}\right],~{}A^{\mathrm{T}}A=\left[\begin{array}[]{ccccc}I_{d}&-I_{d}&&&\\ -I_{d}&2I_{d}&-I_{d}&&\\ &\ddots&\ddots&\ddots&\\ &&-I_{d}&2I_{d}&-I_{d}\\ &&&-I_{d}&I_{d}\end{array}\right].

(22)

Lemma 4

For matrix $A$ defined in (21), its condition number satisfies $\kappa_{A}\!\leq\!\sqrt{2N^{2}\!-\!1}$ .

Proof

Note that $T=AA^{\mathrm{T}}$ is a block tridiagonal Toeplitz matrix and its eigenvalue is $2+2\cos\left(\frac{\pi i}{2N}\right),1\leq i\leq 2N-1$ (see noschese2013tridiagonal ). Accordingly, the condition number of $T$ satisfies

\kappa_{T}=\frac{2+2\cos\left(\frac{\pi}{2N}\right)}{2+2\cos\left(\frac{\pi(2N-1)}{2N}\right)}=\frac{1+\cos\left(\frac{\pi}{2N}\right)}{1-\cos\left(\frac{\pi}{2N}\right)}=1+\frac{2}{\frac{1}{\cos\left(\frac{\pi}{2N}\right)}-1}\leq 2N^{2}-1,

where the last inequality is due to $\cos\left(\frac{\pi x}{2}\right)\leq 1-x^{2}$ , $\forall x\in\left[0,1\right]$ . Consequently, $\kappa_{A}\leq\sqrt{2N^{2}-1}$ . ∎

Next, we demonstrate the propagation of non-zero entries in this example. For first-order linear span algorithm with $b=0$ , we have

\mathcal{S}_{k+1}\subset\mathrm{span}\left\{\mathcal{S}_{k}\cup\left\{\nabla f_{0}(\hat{x}_{k}),A^{\mathrm{T}}A\hat{x}_{k}\right\}\right\}.

It implies that new non-zero entries are introduced either through $\nabla G(x[i],x[N+i])$ , or through the action of $A^{\mathrm{T}}A$ on $x$ . As $A^{\mathrm{T}}A$ is a block tridiagonal matrix, each action of $A^{\mathrm{T}}A$ enables entries in $x[i]$ to ”communicate” with their neighboring vectors, thereby propagating the non-zero entries.

Figure 1 illustrates how non-zero entries propagate. Assume that the initial point is $(x[i])_{j}=0$ for all $i$ and $j$ . In the first iteration, we use Assumption 4.1 to observe that $\operatorname{supp}\left\{\nabla_{u}G(x[j],x[N+j])\right\}\subset[1],1\leq j\leq N$ and $\operatorname{supp}\left\{\nabla_{v}G(x[j],x[N+j])\right\}=\emptyset$ , so it is only possible to have $(x[1:N])_{1}\neq 0$ . In the second iteration, $\nabla G(x[i],x[N+i])$ does not introduce any new non-zero entries, but the action of $A^{\mathrm{T}}A$ on $x$ causes $(x[N+1])_{1}$ to receive a non-zero entry from $(x[N])_{1}$ . In the third iteration, we have $\nabla_{u}G(x[1],x[N+1])\subset[2]$ , which allows $(x[1])_{2}$ to become non-zero. Additionally, $A^{\mathrm{T}}A$ propagates the non-zero entry in $(x[N+1])_{1}$ to $(x[N+2])_{1}$ . By repeating the above propagation mechanism, we can see that by the $(N+1)$ th iteration, both $(x[1:2N])_{1}$ and $(x[1:N-1])_{2}$ become nonzero. In the $(N+2)$ th iteration, $(x[N])_{2}$ becomes nonzero through $\nabla_{u}G(x[N],x[2N])\subset[2]$ . We can consider iterations $2$ to $N+2$ (which consist of $N+1$ iterations) as one complete round of iterations. By repeating this process, each round of iteration can convert up to $2N$ elements to nonzero. After $i-2$ rounds of iteration, specifically at the $((i-1)(N+1)+1)$ th iteration, $(x[1:2N])_{1:i-1}$ and $(x[1:N])_{i}$ become possibly nonzero.

Figure 1: Propagation of nonzero entries. In this figure, the pair

(i,j)

represents the entry

(x[i])_{j}

. The propagation is indicated by blue arrows when passing through

\nabla f_{0}(x)

, and by orange arrows when passing through

A^{\mathrm{T}}Ax

. The number of iteration is placed above the arrows.

Based on the above procedure, it is natural to obtain the following Lemma.

Lemma 5

For $k=(i-1)(N+1)+j$ with $1\leq i\leq d-1,1\leq j\leq N+1$ , we have

\displaystyle\operatorname{supp}\left\{x_{k}\right\}\subset\left\{(1:2N,1:i-1)\cup(1:N+j-1,i)\cup(1:j-2,i+1)\right\},

where the pair $(i,j)$ represents the entry $(x[i])_{j}$ . Therefore, for any $k>0$ , let $K=\left\lfloor\frac{k-2}{N+1}\right\rfloor+1$ , then $(x[i])_{j}=0$ for any $i,j$ satisfy $N+1\leq i\leq 2N,K+1\leq j\leq d$ .

It is well known that the complexity lower bound of the strongly convex case for an unconstrained smooth problem is $\Omega(\sqrt{\kappa_{f}}\log(1/\epsilon))$ nesterov2018lectures , the lower bound of the convex case is $\Omega\left(\sqrt{L_{f}}D/\sqrt{\epsilon}\right)$ nesterov2018lectures and the lower bound of non-convex case is $\Omega(L_{f}\Delta/\epsilon^{2})$ carmon2020lower . These papers construct hard instances with zero-chaining property, where only one possible non-zero entry is added to the decision variable at each iteration (see Chapter 2 of nesterov2018lectures ). In contrast, Lemma 5 indicates that in order to add a non-zero entry to each $x[i]$ , it is necessary to take at least $N+1$ iterations, that is, $\Omega(\kappa_{A})$ iterations by Lemma 4. Therefore, our complexity lower bounds need to be multiplied by an additional factor of $\kappa_{A}$ on top of the lower bounds of unconstrained problems. This intuitively yields our results listed in Table 1.

4.3 Strongly convex case

Let us construct our hard problem instance based on formulation (20). For any given positive parameters $L_{f}\geq\mu_{f}>0,\alpha>0$ and positive integers $d$ , we define function $G(\cdot,\cdot):\mathbb{R}^{d}\times\mathbb{R}^{d}\mapsto\mathbb{R}$ as

G(u,v)=\frac{L_{f}-\mu_{f}}{4}G_{0}(u,v)+\frac{\mu_{f}}{2}\left(\|u\|^{2}+\|v\|^{2}\right),

(23)

where

G_{0}(u,v)\mathrel{\mathop{:}}=\left(\alpha-u_{1}\right)^{2}+\left(v_{1}-u_{2}\right)^{2}+\left(v_{2}-u_{3}\right)^{2}+\cdots+\left(v_{d-1}-u_{d}\right)^{2}.

The construction is based on Nesterov’s well-known “chain-like” quadratic function nesterov2018lectures . In the following, we give some properties of the constructed problem instance.

Lemma 6

The problem defined in (20) has the following properties.

1.

$f_{0}(x)$ is $L_{f}$ -smooth and $\mu_{f}$ -strongly convex.
2.

Denote $q=\frac{\sqrt{\kappa_{f}}-1}{\sqrt{\kappa_{f}}+1}$ . Then the optimal solution $x^{\star}$ of (20) is given by

$x^{\star}[1]=x^{\star}[2]=\cdots=x^{\star}[2N]=\mathbbm{x}^{\star},$ (24)

where $\mathbbm{x}^{\star}\in\mathbb{R}^{d}$ is given by

$\mathbbm{x}^{\star}_{i}=\alpha\cdot\frac{q^{i}+q^{2d+1-i}}{1+q^{2d+1}},\qquad i=1,\cdots,d.$ (25)

Moreover, for any $K\geq 0$ , we have $\sum_{i=K+1}^{d}(\mathbbm{x}^{\star}_{i})^{2}\geq q^{2K}\cdot\frac{d-K}{d}\|\mathbbm{x}^{\star}\|^{2}$ .

Proof

Part 1: Let us fix the vectors $u,v,\omega,\nu\in\mathbb{R}^{d}$ with $\|(\omega,\nu)\|=1$ , and define $h(\theta)\mathrel{\mathop{:}}=G_{0}(u+\theta\omega,v+\theta\nu)$ . If we take $v_{0}=0$ and $\nu_{0}=0$ , it holds

	$\displaystyle h^{\prime\prime}(0)=$	$\displaystyle\sum_{i=1}^{d}\frac{\partial^{2}G_{0}(u,v)}{\partial u_{i}^{2}}\omega_{i}^{2}+\sum_{i=0}^{d-1}\frac{\partial^{2}G_{0}(u,v)}{\partial v_{i}^{2}}\nu_{i}^{2}+2\sum_{i=1}^{d}\frac{\partial^{2}G_{0}(u,v)}{\partial u_{i}\partial v_{i-1}}\omega_{i}\nu_{i-1}$
	$\displaystyle=$	$\displaystyle 2\sum_{i=1}^{d}\omega_{i}^{2}+2\sum_{i=0}^{d-1}\nu_{i}^{2}-4\sum_{i=1}^{d}\omega_{i}\nu_{i-1}.$

On the one hand, due to Cauchy inequality, we have $h^{\prime\prime}(0)\geq 0$ . On the other hand,

h^{\prime\prime}(0)\leq 2\sum_{i=1}^{d}\omega_{i}^{2}+2\sum_{i=0}^{d}\nu_{i}^{2}+2\sum_{i=1}^{d}(\omega_{i}^{2}+\nu_{i-1}^{2})\leq 4,

where the last inequality follows from $\|(\omega,\nu)\|=1$ . Therefore, $G_{0}(u,v)$ is convex and $4$ -smooth, hence $G(u,v)$ is $\mu_{f}$ -strongly convex and $L_{f}$ -smooth, which implies $f_{0}(x)$ is also $\mu_{f}$ -strongly convex and $L_{f}$ -smooth.

Part 2: For any $x$ satisfies the constraint in (20), we have $x=(v,v,\cdots,v)$ and $f_{0}(x)=NG(v,v)$ . Thus, we only need to verify that $\mathbbm{x}^{\star}$ is the (unique) minimum point of the function $G(v,v)$ . By the optimality condition $\nabla_{v}G(v,v)=0$ , we have

\left\{\begin{array}[]{l}(2+\beta)v_{1}^{\star}-v_{2}^{\star}=\alpha,\\ -v_{1}^{\star}+(2+\beta)v_{2}^{\star}-v_{3}^{\star}=0,\\ \qquad\qquad\vdots\\ -v_{d-2}^{\star}+(2+\beta)v_{d-1}^{\star}-v_{n}^{\star}=0,\\ -v_{d-1}^{\star}+(1+\beta)v_{d}^{\star}=0,\end{array}\right.

where we denote $\beta\mathrel{\mathop{:}}=\frac{4\mu_{f}}{L_{f}-\mu_{f}}$ . Note that $q=\frac{1}{2}\left((2+\beta)-\sqrt{(2+\beta)^{2}-4}\right)$ is the smallest root of the quadratic equation $\lambda^{2}-(2+\beta)\lambda+1=0$ . By a direct calculation, we can check that $\mathbbm{x}^{\star}$ satisfies the $d$ equations, and hence $\mathbbm{x}^{\star}$ is the minimum of the function $G(v,v)$ . Lastly, it holds

	$\displaystyle\sum_{i=K+1}^{d}(\mathbbm{x}_{i}^{\star})^{2}=$	$\displaystyle\alpha^{2}\sum_{i=K+1}^{d}\frac{q^{2i}+q^{4d+2-2i}+2q^{2d+1}}{(1+q^{2d+1})^{2}}$
	$\displaystyle\stackrel{{\scriptstyle(i)}}{{=}}$	$\displaystyle\frac{\alpha^{2}}{(1+q^{2d+1})^{2}}\left(2q^{2d+1}(d-K)+\sum_{i=K+1}^{2d-K}q^{2i}\right)$
	$\displaystyle\stackrel{{\scriptstyle(ii)}}{{\geq}}$	$\displaystyle\frac{\alpha^{2}}{(1+q^{2d+1})^{2}}\left(2q^{2d+1}(d-K)+\frac{2d-2K}{2d}\sum_{i=1}^{2d}q^{2i+2K}\right)$
	$\displaystyle\geq$	$\displaystyle\frac{\alpha^{2}(d-K)q^{2K}}{d(1+q^{2d+1})^{2}}\left(2dq^{2d+1}+\sum_{i=1}^{2d}q^{2i}\right)$
	$\displaystyle=$	$\displaystyle q^{2K}\cdot\frac{d-K}{d}\\|\mathbbm{x}^{\star}\\|^{2},$

where $(i)$ follows from $\sum_{i=K+1}^{d}(q^{2i}+q^{4d+2-2i})=\sum_{i=K+1}^{d}q^{2i}+\sum_{i=d+1}^{2d-K}q^{2i}=\sum_{i=K+1}^{2d-K}q^{2i}$ , $(ii)$ holds because $q<1$ and $\frac{1}{2d-2K}\sum_{i=K+1}^{2d-K}q^{2i}\geq\frac{1}{2d}\sum_{i=K+1}^{2d+K}q^{2i}=\frac{1}{2d}\sum_{i=1}^{2d}q^{2i+2K}$ .∎

Now we are ready to give our lower bound result for the strongly convex case.

Theorem 4.2

Let parameters $L_{f}>\mu_{f}>0,\kappa_{A}\geq 1$ be given. For any integer $k\geq\left\lfloor\frac{\kappa_{A}+1}{2}\right\rfloor$ , there exists an instance in $\mathcal{F}_{\mathrm{SC}}(L_{f},\mu_{f},\kappa_{A},D)$ of form (20), with component function $G$ defined in (23), $N=\left\lfloor\frac{\kappa_{A}+1}{2}\right\rfloor,K=\left\lfloor\frac{k-2}{N+1}\right\rfloor+1,d=2K$ and $\alpha$ be suitably chosen. For this problem, the iterates generated by any first-order algorithm in $\mathcal{A}$ satisfies

\operatorname{SubOpt}_{\mathsf{SC}}(x_{k})\geq\frac{1}{2}\left(\frac{\sqrt{\kappa_{f}}-1}{\sqrt{\kappa_{f}}+1}\right)^{\frac{2k}{N}}\|x_{0}-x^{\star}\|.

Proof

By property (1), we know $f_{0}(x)$ is $L_{f}$ -smooth and $\mu_{f}$ -strongly convex. The condition number of $A$ is not great than $\sqrt{2N^{2}-1}\leq\sqrt{\frac{(\kappa_{A}+1)^{2}}{2}-1}\leq\kappa_{A}$ , and $\alpha$ can be suitably chosen so that $\|x_{0}-x^{\star}\|=D$ . Thus the instance we construct belongs to the problem class $\mathcal{F}_{\mathrm{SC}}(L_{f},\mu_{f},\kappa_{A})$ . According to and Lemma 5 and our choice of $K$ , we have $(x_{k}[j])_{s}=0$ for any $N+1\leq j\leq 2N$ and $K+1\leq s\leq d$ . Therefore,

\left\|x_{k}-x^{\star}\right\|^{2}\geq N\times\sum_{s=K+1}^{d}\left(\mathbbm{x}_{s}^{\star}\right)^{2}\geq N\times q^{2K}\cdot\frac{d-K}{d}\left\|\mathbbm{x}^{\star}\right\|^{2},

where the last inequality comes from Property 2. Notice that $\left\|x_{0}-x^{\star}\right\|^{2}=2N\left\|\mathbbm{x}^{\star}\right\|^{2}$ and $K=\frac{d}{2}$ . Substituting them into the above inequality yields

\left\|x_{k}-x^{\star}\right\|^{2}\geq\frac{q^{2K}}{4}\left\|x_{0}-x^{\star}\right\|^{2}.

(26)

Plugging in the definition $q=\frac{\sqrt{\kappa_{f}}-1}{\sqrt{\kappa_{f}}+1}$ and the fact that $K\leq\frac{2k}{N}$ completes the proof. ∎

Corollary 6

For any first-order algorithm in $\mathcal{A}$ , parameters $\kappa_{f}\geq 2$ and $0<\epsilon\leq\frac{D}{20}$ , there exists a problem instance in $\mathcal{F}_{\mathrm{SC}}(L_{f},\mu_{f},\kappa_{A},D)$ such that at least

\Omega\left(\kappa_{A}\sqrt{\kappa_{f}}\log\left(\frac{D}{\epsilon}\right)\right)

iterations are required in order to find an iterate $x_{k}$ satisfies $\operatorname{SubOpt}_{\mathsf{SC}}(x_{k})\leq\epsilon$ .

4.4 Non-convex case

For the hard instance we present in (20), the definition of suboptimal measure becomes

\operatorname{SubOpt}_{\mathsf{NC}}(x)\mathrel{\mathop{:}}=L_{f}\left\|x-\mathcal{P}_{\mathcal{X}}\left(x-\frac{1}{2L_{f}}\nabla f_{0}(x)\right)\right\|,

where $\mathcal{X}$ refers to the feasible region of problem (20), i.e., $\mathcal{X}\mathrel{\mathop{:}}=\{x\in\mathbb{R}^{2Nd}\mid x[1]=x[2]=\cdots=x[2N]\}$ . For given positive integer $d$ , we define the function $G_{0}(\cdot,\cdot):\mathbb{R}^{d}\times\mathbb{R}^{d}\mapsto\mathbb{R}$ as

G_{0}(u,v)=-\Psi(1)\Phi(u_{1})+\sum_{i=2}^{d}\left[\Psi(-v_{i-1})\Phi(-u_{i})-\Psi(v_{i-1})\Phi(u_{i})\right],

(27)

where function $\Phi(x)$ and $\Psi(x)$ are defined as

\Psi(x)\mathrel{\mathop{:}}=\begin{cases}0&x\leq 1/2\\ \exp\left(1-\frac{1}{(2x-1)^{2}}\right)&x>1/2\end{cases}\quad\text{and}\quad\Phi(x)=\sqrt{e}\int_{-\infty}^{x}e^{-\frac{1}{2}t^{2}}\mathrm{d}t.

The function $G(\cdot,\cdot)$ in formulation (20) is a scaled version of $G_{0}(\cdot,\cdot)$ and its formal definition will be given later. Let us discuss $G_{0}(\cdot,\cdot)$ first. Define $g_{0}(v)\mathrel{\mathop{:}}=G_{0}(v,v)$ and one can observe $g_{0}(v)$ coincides with the hard instance constructed in carmon2020lower . We give some useful properties of $G_{0}(u,v)$ .

Lemma 7

The function $G_{0}(u,v)$ has the following properties.

1.

If $v_{d}=0$ , then $\|\nabla g_{0}(v)\|\geq 1$ .
2.

$G_{0}(0,0)-\inf_{u,v}G_{0}(u,v)\leq 12d$ .
3.

$G_{0}(u,v)$ is $l_{0}$ -smooth with $l_{0}$ being a universal constant that is independent of the problem dimension and other constants.

Proof

Part 1 comes from carmon2020lower .

Part 2: Note that $0\leq\Psi(x)\leq e$ and $0\leq\Phi(x)\leq\sqrt{2\pi e}$ . it holds

G_{0}(u,v)\geq-\Psi(1)\Phi(u_{1})-\sum_{i=2}^{d}\Psi(v_{i-1})\Phi(u_{i})\geq-de\sqrt{2\pi e}\geq-12d.

Combining the fact that $G_{0}(0,0)\leq 0$ yields the property.

To prove Part 3, fix $u,v,p,q\in\mathbb{R}^{d}$ with $\|(p,q)\|=1$ , define the function $h(\cdot):\mathbb{R}\mapsto\mathbb{R}$ by the directional projection of $G_{0}(u,v)$ along $(p,q)$ , i.e., $h(\theta)\mathrel{\mathop{:}}=G_{0}(u+\theta p,v+\theta q)$ . Taking $v_{0}=1$ and $q_{0}=0$ , We have

h^{\prime\prime}(0)=\sum_{i=1}^{d}\frac{\partial^{2}G_{0}(u,v)}{\partial u_{i}^{2}}p_{i}^{2}+\sum_{i=1}^{d-1}\frac{\partial^{2}G_{0}(u,v)}{\partial v_{i}^{2}}q_{i}^{2}+2\sum_{i=2}^{d}\frac{\partial^{2}G_{0}(u,v)}{\partial u_{i}\partial v_{i-1}}p_{i}q_{i-1}.

By simple derivations, we can obtain $0<\Psi(x)<e,0\leq\Psi^{\prime}(x)\leq\sqrt{54/e},|\Psi^{\prime\prime}(x)|\leq 8^{5}$ and $0<\Phi(x)<\sqrt{2\pi e},0<\Phi^{\prime}(x)\leq\sqrt{e},\sup_{x}|\Phi^{\prime\prime}(x)|\leq 27$ . It follows

	$\displaystyle\left\|\frac{\partial^{2}G_{0}(u,v)}{\partial u_{i}^{2}}\right\|$	$\displaystyle=\left\|\Psi(-v_{i-1})\Phi^{\prime\prime}(-u_{i})-\Psi(v_{i-1})\Phi^{\prime\prime}(u_{i})\right\|\leq 54e,\quad 1\leq i\leq d,$
	$\displaystyle\left\|\frac{\partial^{2}G_{0}(u,v)}{\partial v_{i}^{2}}\right\|$	$\displaystyle=\left\|\Psi^{\prime\prime}(-v_{i})\Phi(-u_{i+1})-\Psi^{\prime\prime}(v_{i})\Phi(u_{i+1})\right\|\leq 2\sqrt{2\pi e}\times 8^{5},\quad 1\leq i\leq d-1,$
	$\displaystyle\left\|\frac{\partial^{2}G_{0}(u,v)}{\partial u_{i}\partial v_{i-1}}\right\|$	$\displaystyle=\left\|\Psi^{\prime}(-v_{i-1})\Phi^{\prime}(-u_{i})-\Psi^{\prime}(v_{i-1})\Phi^{\prime}(u_{i})\right\|\leq 2\sqrt{54},\quad 2\leq i\leq d.$

Therefore, it holds

	$\displaystyle\|h^{\prime\prime}(0)\|$	$\displaystyle\leq\max\left\{\left\|\frac{\partial^{2}G_{0}(u,v)}{\partial u_{i}^{2}}\right\|,\left\|\frac{\partial^{2}G_{0}(u,v)}{\partial v_{i}^{2}}\right\|,2\left\|\frac{\partial^{2}G_{0}(u,v)}{\partial u_{i}\partial v_{i-1}}\right\|\right\}\sum_{i=1}^{d}\left(p_{i}^{2}+q_{i}^{2}+p_{i}q_{i-1}\right)$
		$\displaystyle\leq 2\sqrt{2\pi e}\times 8^{5}\cdot\sum_{i=1}^{d}\left(p_{i}^{2}+q_{i}^{2}+p_{i}q_{i-1}\right).$

Since $\sum_{i=1}^{d}(p_{i}^{2}+q_{i}^{2})=1$ and $\sum_{i=1}^{d}p_{i}q_{i-1}\leq\|p\|\|q\|\leq 1$ , we obtain

|h^{\prime\prime}(0)|\leq 600000,

which completes the proof. ∎

For given positive parameters $c_{0},L_{f},\alpha>0$ , we construct the following scaled function based on $G_{0}(u,v)$ ,

G(u,v)=\frac{L_{f}\alpha^{2}}{l_{0}}G_{0}\left(\frac{u}{\alpha},\frac{v}{\alpha}\right),

(28)

where $\alpha$ can be adjusted to fulfill the condition $f_{0}(0)-\inf_{x}f_{0}(x)\leq\Delta$ . Similarly, we define $g(v)\mathrel{\mathop{:}}=G(v,v)$ . By simple derivations, we can generalize Lemma 7 to the case of $G(u,v)$ .

Corollary 7

The function $G(u,v)$ has the following properties.

1.

$G(0,0)-\inf_{u,v}G(u,v)\leq 12dl_{0}^{-1}L_{f}\alpha^{2}$ .
2.

If $v_{d}=0$ , then $\|\nabla g(v)\|\geq l_{0}^{-1}L_{f}\alpha$ .
3.

$G(u,v)$ is $L_{f}$ -smooth.

Next, we given a key lemma that gives the relationship between $\operatorname{SubOpt}_{\mathsf{NC}}$ and $\|\nabla g\|$ .

Lemma 8

For any $x\in\mathbb{R}^{2Nd}$ , let $\bar{x}=\frac{1}{2N}\sum_{i=1}^{2N}x[i]$ . Suppose that $G(\cdot,\cdot)$ is $L_{f}$ -smooth, then we have

\operatorname{SubOpt}_{\mathsf{NC}}\geq\frac{\sqrt{N}}{4}\|\nabla g(\bar{x})\|.

Proof

Denote $\bar{\mathcal{G}}(x)\mathrel{\mathop{:}}=\frac{1}{2N}\sum_{i=1}^{2N}\nabla_{x[i]}f_{0}(x)$ . On the one hand, it holds that

\mathcal{P}_{\mathcal{X}}\left(x-\frac{1}{2L_{f}}\nabla f_{0}(x)\right)=\left(\bar{x}-\frac{1}{2L_{f}}\bar{\mathcal{G}}(x),\bar{x}-\frac{1}{2L_{f}}\bar{\mathcal{G}}(x),\cdots,\bar{x}-\frac{1}{2L_{f}}\bar{\mathcal{G}}(x)\right).

Therefore, we have

\begin{split}\left\|x-\mathcal{P}_{\mathcal{X}}\left(x-\frac{1}{2L_{f}}\nabla f_{0}(x)\right)\right\|^{2}&=\sum_{i=1}^{2N}\left\|x[i]-\bar{x}+\frac{1}{2L_{f}}\bar{\mathcal{G}}(x)\right\|^{2}\\ &=\sum_{i=1}^{2N}\|x[i]-\bar{x}\|^{2}+\frac{N}{2L_{f}^{2}}\|\bar{\mathcal{G}}(x)\|^{2},\end{split}

(29)

where the last equality holds because $\sum_{i=1}^{2N}(x[i]-\bar{x})=0$ . On the other hand, recall that $g(x)=G(x,x)$ , by chain rule we have

\begin{split}N\|\nabla g(\bar{x})\|=&N\left\|\nabla_{u}G(\bar{x},\bar{x})+\nabla_{v}G(\bar{x},\bar{x})\right\|\\ \leq&\sum_{i=1}^{N}\left\|\nabla_{u}G(\bar{x},\bar{x})-\nabla_{u}G(x[i],x[i+N])\right\|\\ &+\sum_{i=1}^{N}\left\|\nabla_{v}G(\bar{x},\bar{x})-\nabla_{v}G(x[i],x[i+N])\right\|\\ &+\left\|\sum_{i=1}^{N}\left(\nabla_{u}G(x[i],x[i+N])+\nabla_{v}G(x[i],x[i+N])\right)\right\|.\end{split}

(30)

For the first term on the right hand side, we have

		$\displaystyle\left\\|\nabla_{u}G(\bar{x},\bar{x})-\nabla_{u}G(x[i],x[i+N])\right\\|$
	$\displaystyle\leq$	$\displaystyle~{}\left\\|\nabla_{u}G(\bar{x},\bar{x})-\nabla_{u}G(x[i],\bar{x})\right\\|+\left\\|\nabla_{u}G(x[i],\bar{x})-\nabla_{u}G(x[i],x[i+N])\right\\|$
	$\displaystyle\leq$	$\displaystyle~{}L_{f}\left(\\|\bar{x}-x[i]\\|+\\|\bar{x}-x[i+N]\\|\right),$

where the second inequality is due to the $L_{f}$ -smoothness of $G(\cdot,\cdot)$ . Similar derivation also applies to the second term on the right hand side of (30). Therefore, it holds

	$\displaystyle N\\|\nabla g(\bar{x})\\|$	$\displaystyle\leq 2L_{f}\sum_{i=1}^{2N}\\|\bar{x}-x[i]\\|+2N\\|\bar{\mathcal{G}}(x)\\|$
		$\displaystyle\stackrel{{\scriptstyle(i)}}{{\leq}}2L_{f}\sqrt{2N\sum_{i=1}^{2N}\\|\bar{x}-x[i]\\|^{2}}+2N\\|\bar{\mathcal{G}}(x)\\|$
		$\displaystyle\stackrel{{\scriptstyle(ii)}}{{\leq}}\sqrt{8L_{f}^{2}N+8L_{f}^{2}N}\sqrt{\sum_{i=1}^{2N}\\|x[i]-\bar{x}\\|^{2}+\frac{N}{2L_{f}^{2}}\\|\bar{\mathcal{G}}(x)\\|^{2}}$
		$\displaystyle\stackrel{{\scriptstyle(iii)}}{{\leq}}4\sqrt{N}\cdot\operatorname{SubOpt}_{\mathsf{NC}}(x),$

where $(i)$ and $(ii)$ come from Cauchy–Schwarz inequality, $(iii)$ utilizes equality (29). ∎

We can now state a complexity lower bound for finding an approximate solution of non-convex problems with first-order linear span algorithms.

Theorem 4.3

Let parameters $L_{f},\Delta>0,\kappa_{A}\geq 1$ be given. For any integer $k\geq\kappa_{A}$ , there exists an instance in $\mathcal{F}_{\mathrm{NC}}(L_{f},\kappa_{A},\Delta)$ of form (20), with component function $G$ defined in (27), $N=\left\lfloor\frac{\kappa_{A}+1}{2}\right\rfloor,K=\left\lfloor\frac{k-2}{N+1}\right\rfloor+1,d=K+2$ and $\alpha=\sqrt{\frac{l_{0}\Delta}{12NdL_{f}}}$ . For this problem, suppose the initial point $x_{0}=0$ , then the iterates generated by any first-order algorithm in $\mathcal{A}$ satisfies

\operatorname{SubOpt}_{\mathsf{NC}}(x_{k})\geq c_{0}\cdot\sqrt{\frac{\kappa_{A}L_{f}\Delta}{k}}.

Proof

By property (3) in Corollary 7, we know $f_{0}(x)$ is $L_{f}$ -smooth. According to the definition of $\alpha$ , it holds $F(0)-\inf_{x}F(x)\leq f_{0}(0)-\inf_{x}f_{0}(x)\leq 12Ndl_{0}^{-1}L_{f}\alpha^{2}\leq\Delta$ . As proved in Theorem 4.2, the condition number of $A$ is not great than $\kappa_{A}$ . Accordingly, the instance problem belongs to the class $\mathcal{F}_{\mathrm{NC}}(L_{f},\kappa_{A},\Delta)$ . Since $K=d-2$ , the last element of all $2N$ vectors is zero, that is, $(x[i])_{d}=0,\forall 1\leq i\leq 2N$ . It implies $\bar{x}_{d}=0$ . Utilizing Lemma 8 and property (2) in Corollary 7, we obtain

\operatorname{SubOpt}_{\mathsf{NC}}(x)\geq\frac{\sqrt{N}}{4}\|\nabla g(\bar{x})\|\geq\frac{\sqrt{N}}{4}l_{0}^{-1}L_{f}\alpha=\frac{1}{4}\sqrt{\frac{L_{f}\Delta}{12l_{0}d}}.

By definition, we have $d=K+2\leq\frac{5k}{\kappa_{A}}$ , and this gives the desired result. ∎

Corollary 8

For any first-order algorithm in $\mathcal{A}$ and $0<\epsilon\leq c_{0}\sqrt{L_{f}\Delta}$ where $c_{0}$ is defined in Theorem 4.3, there exists a problem instance in $\mathcal{F}_{\mathrm{NC}}(L_{f},\kappa_{A},\Delta)$ such that at least

\Omega\left(\frac{\kappa_{A}L_{f}\Delta}{\epsilon^{2}}\right)

iterations are required in order to find an iterate $x_{k}$ satisfies $\operatorname{SubOpt}_{\mathsf{NC}}(x_{k})\leq\epsilon$ .

4.5 Convex case

For the linear equality constrained problem, the definition of suboptimality becomes

\displaystyle\operatorname{SubOpt}_{\mathsf{C}}(x)=f(x)-\min_{x:Ax=b}f(x)+\rho\|Ax-b\|.

Recall that for $\rho\geq 2\left\|y^{\star}\right\|$ , $\operatorname{SubOpt}_{\mathsf{C}}(x)$ implies the standard optimality measure $\max\{\left|f(x)-f(x^{\star})\right|,\left\|Ax-b\right\|\}$ .

Note that in Section 4.3, we consider the strongly convex problem class $\mathcal{F}(L_{f},\mu_{f},\kappa_{A})$ with $\mu_{f}>0$ . In this section, we demonstrate how the complexity lower bound present in Theorem 4.2 can be reduced to the convex problem class with $\mu_{f}=0$ .

Theorem 4.4

Let parameters $L_{f},D>0,\kappa_{A}\geq 1$ be given. For any integer $k\geq\kappa_{A}$ , there exists an instance in $\mathcal{F}_{\mathrm{C}}(L_{f},\kappa_{A},D)$ of form (20), with component function $G$ defined in (23), $N=\left\lfloor\frac{\kappa_{A}+1}{2}\right\rfloor,K=\left\lfloor\frac{k-2}{N+1}\right\rfloor+1,d=2K,\mu_{f}=\frac{L_{f}}{(K+1)^{2}}$ and $\alpha$ be suitably chosen. For this problem, the $k$ -th iterate $x_{k}$ generated by any first-order algorithm in $\mathcal{A}$ satisfies

f(x_{k})-f(x^{\star})-\left\langle{\lambda^{\star}},{Ax_{k}-b}\right\rangle\geq c_{0}\cdot\frac{\kappa_{A}^{2}L_{f}D^{2}}{k^{2}},

where $\lambda^{\star}$ is the optimal dual variable and $c_{0}$ is a universal constant.

Proof

We can suitably choose $\alpha$ so that $\|x_{0}-x^{\star}\|=D$ . Then, under our choice of parameters, we have $\kappa_{f}=L_{f}/\mu_{f}=(K+1)^{2}$ . Note that $q\geq\exp\left(-\frac{2}{\sqrt{\kappa_{f}}-1}\right)$ , and hence $q^{2K}\geq e^{-4}$ . Therefore, using the fact that $f$ is $\mu_{f}$ -strongly convex and $A^{\mathrm{T}}\lambda^{\star}=\nabla f(x^{\star})$ , we have

	$\displaystyle f(x_{k})-f(x^{\star})-\left\langle{\lambda^{\star}},{Ax_{k}-b}\right\rangle=$	$\displaystyle~{}f(x_{k})-f(x^{\star})-\left\langle{\nabla f(x^{\star})},{x_{k}-x^{\star}}\right\rangle$
	$\displaystyle\geq$	$\displaystyle~{}\frac{\mu_{f}}{2}\left\\|x_{k}-x^{\star}\right\\|^{2}\geq\frac{\mu_{f}q^{2K}}{8}\left\\|x_{0}-x^{\star}\right\\|^{2}$
	$\displaystyle\geq$	$\displaystyle~{}\frac{e^{-4}}{8}\cdot\frac{L_{f}D^{2}}{(K+1)^{2}},$

where the last two inequalities are due to (26). According to the choice of $K$ and $N$ , we have $K+1\leq\frac{4k}{\kappa_{A}}$ , and hence complete proof. ∎

Corollary 9

For any first-order algorithm in $\mathcal{A}$ and any $0<\epsilon\leq c_{0}L_{f}D^{2}$ where $c_{0}$ is defined in Theorem 4.4, there exists a problem instance in $\mathcal{F}_{\mathrm{C}}(L_{f},\kappa_{A},D)$ such that at least

\Omega\left(\frac{\kappa_{A}\sqrt{L_{f}}D}{\sqrt{\epsilon}}\right)

iterations are required in order to find an iterate $x_{k}$ satisfies $\operatorname{SubOpt}_{\mathsf{C}}(x_{k})\leq\epsilon$ with $\rho\geq\left\|\lambda^{\star}\right\|$ .

5 Linear equality constrained problem

In this section, we focus on a special case of (1) when $h(\cdot)=\mathbb{I}_{\{0\}}(\cdot)$ , corresponding to the linear equality constrained problem

\min_{x}\ f(x),\quad\,\textrm{s.t.}\,\ Ax=b.

(31)

In Section 5.1, it will be demonstrated that the requirement of full row rank for the matrix $A$ can be removed. In Section 5.2, we will show that the optimal iteration complexity can be achieved by directly applying APPA on the convex problem.

5.1 Refined result of linear equality constrained problem

First, we consider the strongly convex case. Similar with the derivation in Section 3.1, problem (31) is equivalent to

\min_{x}\max_{\lambda}\ \mathcal{L}(x,\lambda)\mathrel{\mathop{:}}=f(x)+\lambda^{\mathrm{T}}(Ax-b).

Due to the strong duality, we have $\min_{x}\max_{\lambda}\ \mathcal{L}(x,\lambda)=\max_{\lambda}\min_{x}\ \mathcal{L}(x,\lambda)$ . Accordingly, the corresponding dual problem can be written as

\max_{\lambda}\ \Phi(\lambda)\mathrel{\mathop{:}}=-b^{\mathrm{T}}\lambda-f^{\star}(-A^{\mathrm{T}}\lambda).

(32)

Without the assumption that $A$ is full row rank, the proof of Lemma 3 no longer holds and $\Phi(\lambda)$ is not necessarily strongly convex. Actually, we have the following property.

Lemma 9

Let $\underline{\sigma}_{\min}$ be the nonzero minimal singular value of $A$ . It holds that $\Phi(\lambda)$ is $(\underline{\sigma}_{\min}^{2}/L_{f})$ -strongly concave in the subspace $\mathcal{R}(A)$ .

Proof

Noting that $f(x)$ is $L_{f}$ -smooth and $\mu_{f}$ -strongly convex, it is easy to derive that $f^{\star}(x)$ is $\frac{1}{\mu_{f}}$ -smooth and $\frac{1}{L_{f}}$ -strongly convex. Then for any $\lambda,\lambda^{\prime}\in\mathcal{R}(A)$ , we have

	$\displaystyle\Phi(\lambda^{\prime})-\Phi(\lambda)$	$\displaystyle=-\langle b,\lambda^{\prime}-\lambda\rangle-f^{\star}(-A^{\mathrm{T}}\lambda^{\prime})+f^{\star}(-A^{\mathrm{T}}\lambda)$
		$\displaystyle\leq-\langle b,\lambda^{\prime}-\lambda\rangle+\langle\nabla f^{\star}(-A^{\mathrm{T}}\lambda),A^{\mathrm{T}}\lambda^{\prime}-A^{\mathrm{T}}\lambda\rangle-\frac{\\|A^{\mathrm{T}}\lambda-A^{\mathrm{T}}\lambda^{\prime}\\|^{2}}{2L_{f}}$
		$\displaystyle=\langle\nabla\Phi(\lambda),\lambda^{\prime}-\lambda\rangle-\frac{1}{2L_{f}}\\|A^{\mathrm{T}}\lambda-A^{\mathrm{T}}\lambda^{\prime}\\|^{2}$
		$\displaystyle\leq\langle\nabla\Phi(\lambda),\lambda^{\prime}-\lambda\rangle-\frac{\underline{\sigma}_{\min}^{2}}{2L_{f}}\\|\lambda-\lambda^{\prime}\\|^{2},$

where the last inequality holds because $\lambda,\lambda^{\prime}\in\mathcal{R}(A)$ . ∎

According to the update rule of Algorithm 2, if we set $\lambda_{0}=0$ , then we have

\lambda_{k+1}\in\mathrm{Span}\left\{Ax_{0}-b,Ax_{1}-b,\cdots,Ax_{k}-b\right\},k\geq 0.

Due to the feasibility of the constraint $Ax=b$ , we have $b\in\mathcal{R}(A)$ . Hence it holds that the iterate sequence $\{\lambda_{k}\}$ always stays in $\mathcal{R}(A)$ . Combining Lemma 9, we can treat $\Phi(\lambda)$ as a strongly concave function during the iterations. Then the derivation in Section 3.1 still holds by replacing $\mu_{A}$ with $\underline{\sigma}_{\min}$ and replacing $\kappa_{A}$ with $\underline{\kappa}_{A}\mathrel{\mathop{:}}=\frac{L_{A}}{\underline{\sigma}_{\min}}$ . Consequently, the derived complexity upper bound is $\tilde{O}(\underline{\kappa}_{A}\sqrt{\kappa_{f}}\log(1/\epsilon))$ .

For the convex problem, Algorithm 2 is utilized to solve the strongly convex auxiliary problem (19), hence the upper bound can also be generalized to $\tilde{O}\left(\frac{\underline{\kappa}_{A}\sqrt{L_{f}}D}{\sqrt{\epsilon}}\right)$ . For the nonconvex problem, Algorithm 2 is utilized to solve the subproblem (16). Therefore, the complexity upper bound can be similarly generalized to $\tilde{O}\left(\frac{\underline{\kappa}_{A}L_{f}\Delta}{\epsilon^{2}}\right)$ .

5.2 Direct acceleration for convex problem

Recall that in Section 3.4, we derive the upper bound of the convex problem by constructing a strongly convex auxiliary problem. In this section, we propose an optimal algorithm, present in Algorithm 4, that solves the original convex problem directly. The algorithm performs an inexact accelerated proximal point method on $f(x)$ while keeping the constraint intact in the outer loop. In the inner loop, it uses Algorithm 2 to iteratively solve the subproblem until it meets the first-order suboptimality conditions.

1 Input: initial point

x_{0}

, smoothness parameter

L_{f}

, condition number

\kappa_{A}

, subproblem error tolerances

\{\epsilon_{k}\}

and

\{\gamma_{k}\}

, the maximum iteration number

T

2 Initialize:

y_{1}\leftarrow x_{0}

t_{1}\leftarrow 1

3 for $k=1\cdots T$ do

4 Apply Algorithm 2 to find a pair of suboptimal primal and dual variable

(x_{k},\lambda_{k})

of problem

\min_{x}f(x)+\frac{L_{f}}{2}\|x-y_{k}\|^{2},\quad\,\textrm{s.t.}\,Ax=b,

(33)

such that

\|\nabla f(x_{k})+L_{f}(x_{k}-y_{k})+A^{\mathrm{T}}\lambda_{k}\|\leq\sqrt{\frac{L_{f}}{2}}\cdot\frac{\epsilon_{k}}{t_{k}}

and

\|Ax_{k}-b\|\leq\gamma_{k}

5 Compute

t_{k+1}=\frac{1+\sqrt{1+4t_{k}^{2}}}{2}

6 Compute

y_{k+1}=x_{k}+\left(\frac{t_{k}-1}{t_{k+1}}\right)\left(x_{k}-x_{k-1}\right)

Output:

x_{T}

Algorithm 4 Inexact APPA for convex problems

The convergence proof is inspired by beck2009fast and jiang2012inexact . Different from several work that studies the inexact accelerated PPA for the unconstrained convex problem, we have an additional linear equality constraint here. Since the projection onto the linear constraint is not allowed, we need to incorporate the violation of the constraint into the analysis, which makes the convergence proof quite different. For simplicity, we define the following notations:

\nabla f(x_{k})+L_{f}(x_{k}-y_{k})+A^{\mathrm{T}}\lambda_{k}=\delta_{k},~{}~{}Ax_{k}-b=\zeta_{k},

v_{k}=f(x_{k})-f(x^{\star})+\langle\lambda^{\star},Ax_{k}-b\rangle\geq 0,~{}~{}u_{k}=t_{k}x_{k}-\left(t_{k}-1\right)x_{k-1}-x^{\star},

a_{k}=t_{k}^{2}v_{k}\geq 0,~{}~{}b_{k}=\frac{L_{f}}{2}\|u_{k}\|^{2},~{}~{}\tau=\frac{L_{f}}{2}\|x_{0}-x^{\star}\|^{2},~{}~{}e_{k}=t_{k}\left\langle\delta_{k},u_{k}\right\rangle,

\omega_{k}=\|x_{k}-x^{\star}\|,~{}~{}\xi_{k}=|t_{k}\langle\lambda_{k}-\lambda^{\star},Au_{k}\rangle|,~{}~{}\bar{\epsilon}_{k}=\sum_{j=1}^{k}\epsilon_{j},~{}~{}\bar{\xi}_{k}=\sum_{j=1}^{k}\left(\xi_{j}+\epsilon_{j}^{2}\right).

Due to the KKT condition, we know $Ax^{\star}=b$ and $A^{\mathrm{T}}\lambda^{\star}=\nabla f(x^{\star})$ . It follows that $v_{k}=f(x_{k})-f(x^{\star})-\langle\nabla f(x^{\star}),x-x^{\star}\rangle$ , which corresponds to the Bregman divergence associated with $f$ for point $x_{k}$ and $x^{\star}$ . In other words, $v_{k}$ measures the distance between $x_{k}$ and $x^{\star}$ under the Bregman divergence and can somewhat serves as a suboptimality measure. In the following analysis, we will give an upper bound of $v_{k}$ (or $a_{k}$ ) first and eventually derive an upper bound for the objective function gap and the constraint violation.

According to the update rule $t_{k+1}=\frac{1+\sqrt{1+4t_{k}^{2}}}{2}$ and $t_{1}=1$ , it is easy to obtain the following lemma, which will be frequently used in the following proofs.

Lemma 10

The sequence $\{t_{k}\}$ generated by Algorithm 4 satisfies $\frac{k+1}{2}\leq t_{k}\leq k$ for any $k\geq 1$ .

In the following lemma, we give the one-step estimation of $a_{k}+b_{k}$ .

Lemma 11

For any $k\geq 1$ , it holds

a_{k+1}+b_{k+1}\leq a_{k}+b_{k}+\xi_{k+1}+e_{k+1}.

(34)

Proof

For each $j$ and any $x$ , we have

\begin{split}f(x)-f(x_{j})&\geq\langle x-x_{j},\nabla f(x_{j})\rangle\\ &=\langle x-x_{j},\delta_{j}-L_{f}(x_{j}-y_{j})-A^{\mathrm{T}}\lambda_{j}\rangle\\ &\geq\frac{L_{f}}{2}\|x_{j}-y_{j}\|^{2}+L_{f}\langle x_{j}-y_{j},y_{j}-x\rangle\\ &\quad+\langle x-x_{j},\delta_{j}-A^{\mathrm{T}}\lambda_{j}\rangle.\end{split}

(35)

Note that $v_{k}-v_{k+1}=f(x_{k})-f(x_{k+1})+\langle\lambda^{\star},Ax_{k}-Ax_{k+1}\rangle$ . We apply (35) with $j=k+1$ and $x=x_{k}$ to get

\begin{split}v_{k}-v_{k+1}&\geq\frac{L_{f}}{2}\|x_{k+1}-y_{k+1}\|^{2}+L_{f}\langle x_{k+1}-y_{k+1},y_{k+1}-x_{k}\rangle\\ &\quad+\langle\zeta_{k}-\zeta_{k+1},\lambda^{\star}-\lambda_{k+1}\rangle+\langle x_{k}-x_{k+1},\delta_{k+1}\rangle,\end{split}

(36)

where we utilize the fact that $Ax_{k}-Ax_{k+1}=\zeta_{k}-\zeta_{k+1}$ . Similarly, let $j=k+1$ and $x=x^{\star}$ in (35), we obtain

\begin{split}-v_{k+1}&=f(x^{\star})-f(x_{k+1})-\langle\lambda^{\star},Ax_{k+1}-b\rangle\\ &\geq\frac{L_{f}}{2}\|x_{k+1}-y_{k+1}\|^{2}+L_{f}\langle x_{k+1}-y_{k+1},y_{k+1}-x^{\star}\rangle\\ &\quad-\langle Ax^{\star}-Ax_{k+1},\lambda_{k+1}\rangle-\langle\lambda^{\star},Ax_{k+1}-b\rangle+\langle x^{\star}-x_{k+1},\delta_{k+1}\rangle\\ &=\frac{L_{f}}{2}\|x_{k+1}-y_{k+1}\|^{2}+L_{f}\langle x_{k+1}-y_{k+1},y_{k+1}-x^{\star}\rangle\\ &\quad-\langle\zeta_{k+1},\lambda^{\star}-\lambda_{k+1}\rangle+\langle x^{\star}-x_{k+1},\delta_{k+1}\rangle,\end{split}

(37)

Multiply (36) by $(t_{k+1}-1)$ and add it to (37). Then, multiply $t_{k+1}$ on the both side of the obtained inequality and notice the relation $t_{k}^{2}=t_{k+1}^{2}-t_{k+1}$ , we have

	$\displaystyle t_{k}^{2}v_{k}-t_{k+1}^{2}v_{k+1}$	$\displaystyle\geq\frac{L_{f}}{2}\\|t_{k+1}(x_{k+1}-y_{k+1})\\|^{2}$
		$\displaystyle\quad+L_{f}t_{k+1}\langle x_{k+1}-y_{k+1},t_{k+1}y_{k+1}-(t_{k+1}-1)x_{k}-x^{\star}\rangle$
		$\displaystyle\quad+\langle\lambda^{\star}-\lambda_{k+1},t_{k}^{2}\zeta_{k}-t_{k+1}^{2}\zeta_{k+1}\rangle-t_{k+1}\langle\delta_{k+1},u_{k+1}\rangle.$

For the first two terms on the right hand side of the above inequality, we use the usual Pythagoras relation $\|{b}-{a}\|^{2}+2\langle{b}-{a},{a}-{c}\rangle=\|{b}-{c}\|^{2}-\|{a}-{c}\|^{2}$ , then substitute $t_{k+1}y_{k+1}=t_{k+1}x_{k}+\left(t_{k}-1\right)\left(x_{k}-x_{k-1}\right)$ into it. After rearranging, we obtain

t_{k+1}^{2}v_{k+1}+\frac{L_{f}}{2}\|u_{k+1}\|^{2}\leq t_{k}^{2}v_{k}+\frac{L_{f}}{2}\|u_{k}\|^{2}-\langle\lambda^{\star}-\lambda_{k+1},t_{k}^{2}\zeta_{k}-t_{k+1}^{2}\zeta_{k+1}\rangle+e_{k+1}.

Combining the fact $t_{k+1}Au_{k+1}=t_{k+1}^{2}\zeta_{k+1}-t_{k}^{2}\zeta_{k}$ yields the conclusion. ∎

Now, having the inequality (34), we can derive an upper bound of $a_{k}$ .

Lemma 12

For any $k\geq 1$ , it holds

a_{k}\leq\left(\sqrt{\tau}+\bar{\epsilon}_{k}\right)^{2}+2\bar{\xi}_{k}.

(38)

Proof

By applying (35) with $x=x^{\star}$ and $j=1$ , we have

	$\displaystyle f(x^{\star})-f(x_{1})$	$\displaystyle\geq\frac{L_{f}}{2}\\|x_{1}-y_{1}\\|^{2}+L_{f}\langle x_{1}-y_{1},y_{1}-x^{\star}\rangle$
		$\displaystyle\quad-\langle Ax^{\star}-Ax_{1},\lambda_{1}\rangle+\langle x^{\star}-x_{1},\delta_{1}\rangle$
		$\displaystyle=\frac{L_{f}}{2}\\|x_{1}-x^{\star}\\|^{2}-\frac{L_{f}}{2}\\|y_{1}-x^{\star}\\|^{2}$
		$\displaystyle\quad-\langle Ax^{\star}-Ax_{1},\lambda_{1}\rangle+\langle x^{\star}-x_{1},\delta_{1}\rangle,$

where we utilize the usual Pythagoras relation $\|{b}-{a}\|^{2}+2\langle{b}-{a},{a}-{c}\rangle=\|{b}-{c}\|^{2}-\|{a}-{c}\|^{2}$ . Noting that $y_{1}=x_{0}$ , $t_{1}=1$ and $u_{1}=x_{1}-x^{\star}$ , we have

t_{1}^{2}v_{1}+\frac{L_{f}}{2}\|u_{1}\|^{2}\leq\frac{L_{f}}{2}\|x_{0}-x^{\star}\|^{2}+t_{1}\langle\lambda^{\star}-\lambda_{1},Au_{1}\rangle+e_{1}.

Since $e_{k}\leq(t_{k}\sqrt{2/L_{f}}\|\delta_{k}\|)(\sqrt{L_{f}/2}\|u_{k}\|)\leq\epsilon_{k}\sqrt{b_{k}}$ , the above inequality implies

a_{1}+b_{1}\leq\tau+\xi_{1}+\epsilon_{1}\sqrt{b_{1}}.

(39)

Let $s_{k}=\sum_{i=1}^{k}\xi_{k}+\sum_{i=1}^{k}\epsilon_{i}\sqrt{b_{i}}$ . Utilizing (34) repeatedly and combining (39), we obtain

a_{k}+b_{k}\leq\tau+s_{k}.

(40)

Since $a_{k}\geq 0$ , we have $s_{k}=s_{k-1}+\epsilon_{k}\sqrt{b_{k}}+\xi_{k}\leq s_{k-1}+\epsilon_{k}\sqrt{\tau+s_{k}}+\xi_{k}$ , which implies

\sqrt{\tau+s_{k}}\leq\frac{1}{2}\left(\epsilon_{k}+\sqrt{\epsilon_{k}^{2}+4\left(\tau+s_{k-1}+\xi_{k}\right)}\right).

By some simple derivations, we get

\begin{split}s_{k}&\leq s_{k-1}+\frac{1}{2}\epsilon_{k}^{2}+\xi_{k}+\frac{1}{2}\epsilon_{k}\sqrt{\epsilon_{k}^{2}+4\left(\tau+s_{k-1}+\xi_{k}\right)}\\ &\leq s_{k-1}+\frac{1}{2}\epsilon_{k}^{2}+\xi_{k}+\frac{1}{2}\epsilon_{k}\left(\epsilon_{k}+2\sqrt{\tau}+2\sqrt{s_{k-1}+\xi_{k}}\right)\\ &\leq s_{k-1}+\epsilon_{k}^{2}+\xi_{k}+\epsilon_{k}\left(\sqrt{\tau}+\sqrt{s_{k-1}+\xi_{k}}\right).\end{split}

(41)

From the inequality (39), we know that $\tau\geq b_{1}-\epsilon_{1}\sqrt{b_{1}}-\xi_{1}$ , which follows $\sqrt{b_{1}}\leq\frac{1}{2}\left(\epsilon_{1}+\sqrt{\epsilon_{1}^{2}+4\left(\tau+\xi_{1}\right)}\right)\leq\epsilon_{1}+\sqrt{\tau+\xi_{1}}$ . Accordingly,

s_{1}=\epsilon_{1}\sqrt{b_{1}}+\xi_{1}\leq\epsilon_{1}\left(\epsilon_{1}+\sqrt{\tau+\xi_{1}}\right)+\xi_{1}\leq\epsilon_{1}^{2}+\xi_{1}+\epsilon_{1}\left(\sqrt{\tau}+\sqrt{\xi_{1}}\right).

(42)

Applying (41) repeatedly and combining (42) yields

	$\displaystyle s_{k}$	$\displaystyle\leq s_{1}+\sum_{j=2}^{k}\epsilon_{j}^{2}+\sum_{j=2}^{k}\xi_{j}+\sqrt{\tau}\sum_{j=2}^{k}\epsilon_{j}+\sum_{j=2}^{k}\epsilon_{j}\sqrt{s_{j-1}+\xi_{j}}$
		$\displaystyle\leq\sum_{j=1}^{k}\epsilon_{j}^{2}+\sum_{j=1}^{k}\xi_{j}+\sqrt{\tau}\sum_{j=1}^{k}\epsilon_{j}+\sum_{j=1}^{k}\epsilon_{j}\sqrt{s_{j}}$
		$\displaystyle\leq\bar{\xi}_{k}+\sqrt{\tau}\bar{\epsilon}_{k}+\sqrt{s_{k}}\bar{\epsilon}_{k},$

where the second inequality holds because $s_{j-1}+\xi_{j}\leq s_{j}$ and $\xi_{1}\leq s_{1}$ by the definition of $s_{k}$ . The above inequality implies

\sqrt{s_{k}}\leq\frac{1}{2}\left(\bar{\epsilon}_{k}+\left(\bar{\epsilon}_{k}^{2}+4\bar{\xi}_{k}+4\bar{\epsilon}_{k}\sqrt{\tau}\right)^{1/2}\right),

(43)

and hence $s_{k}\leq\bar{\epsilon}_{k}^{2}+2\bar{\xi}_{k}+2\bar{\epsilon}_{k}\sqrt{\tau}$ . Substituting it into (40), we obtain

a_{k}\leq\tau+\bar{\epsilon}_{k}^{2}+2\bar{\xi}_{k}+2\bar{\epsilon}_{k}\sqrt{\tau}\leq\left(\sqrt{\tau}+\bar{\epsilon}_{k}\right)^{2}+2\bar{\xi}_{k},

which completes the proof. ∎

Note that the right hand side of (38) depends on $\xi_{k}$ , which is still unclear. We further estimate this term in the following lemma.

Lemma 13

Assume that $\gamma_{k}\leq 1$ , $\|\delta_{k}\|\leq 1$ and $t_{k+1}^{2}\gamma_{k+1}\leq t_{k}^{2}\gamma_{k}$ for any $k\geq 1$ , we have

	$\displaystyle\xi_{1}$	$\displaystyle\leq\frac{1}{\underline{\sigma}_{\min}^{2}}(L_{f}L_{A}\omega_{1}+L_{f}(L_{A}D+1)+L_{A})t_{1}^{2}\gamma_{1},$		(44)
	$\displaystyle\xi_{k}$	$\displaystyle\leq\frac{2}{\underline{\sigma}_{\min}^{2}}(L_{f}L_{A}\omega_{k}+4L_{f}+L_{A})t_{k-1}^{2}\gamma_{k-1},\quad\forall k\geq 2.$		(45)

Proof

By the update rule $y_{k+1}=x_{k}+\left(\frac{t_{k}-1}{t_{k+1}}\right)\left(x_{k}-x_{k-1}\right)$ , it holds

\displaystyle Ax_{k+1}-Ay_{k+1}=\zeta_{k+1}-\left(1+\frac{t_{k}-1}{t_{k+1}}\right)\zeta_{k}+\frac{t_{k}-1}{t_{k+1}}\zeta_{k-1}.

Thus, we have $\|Ax_{k+1}-Ay_{k+1}\|\leq 4$ due to the fact that $\gamma_{k}\leq 1$ and $1\leq t_{k}\leq t_{k+1}$ . Combining the KKT condition $\nabla f(x^{\star})+A^{\mathrm{T}}\lambda^{\star}=0$ with the definition of $\delta_{k+1}$ , we obtain

A^{\mathrm{T}}(\lambda^{\star}-\lambda_{k+1})=\nabla f(x_{k+1})-\nabla f(x^{\star})+L_{f}(x_{k+1}-y_{k+1})-\delta_{k+1}.

Accordingly, it follows

\begin{split}&\|AA^{\mathrm{T}}(\lambda^{\star}-\lambda_{k+1})\|\\ =&~{}\|A(\nabla f(x_{k+1})-\nabla f(x^{\star}))+L_{f}(Ax_{k+1}-Ay_{k+1})-A\delta_{k+1}\|\\ \leq&~{}{L_{f}L_{A}\|x_{k+1}-x^{\star}\|}+4L_{f}+L_{A},\end{split}

(46)

where the inequality uses the smoothness of $f$ and $\|\delta_{k}\|\leq 1$ . Since $\lambda^{\star}-\lambda_{k+1}\in\mathcal{R}(A)$ , the above inequality actually implies

\|\lambda^{\star}-\lambda_{k+1}\|\leq\frac{1}{\underline{\sigma}_{\min}^{2}}\left({L_{f}L_{A}\|x_{k+1}-x^{\star}\|}+4L_{f}+L_{A}\right).

It follows that for any $k\geq 1$ ,

\begin{split}\xi_{k+1}&=|\langle\lambda^{\star}-\lambda_{k+1},t_{k+1}Au_{k+1}\rangle|\\ &\leq\frac{1}{\underline{\sigma}_{\min}^{2}}(L_{f}L_{A}\|x_{k+1}-x^{\star}\|+4L_{f}+L_{A})\|t_{k}^{2}\zeta_{k}-t_{k+1}^{2}\zeta_{k+1}\|\\ &\leq\frac{2}{\underline{\sigma}_{\min}^{2}}(L_{f}L_{A}\|x_{k+1}-x^{\star}\|+4L_{f}+L_{A})t_{k}^{2}\gamma_{k},\end{split}

(47)

where the last inequality uses the $t_{k+1}^{2}\gamma_{k+1}\leq t_{k}^{2}\gamma_{k}^{2}$ .

Similar to the derivations in (46) and (47), we can also obtain

		$\displaystyle\\|AA^{\mathrm{T}}(\lambda^{\star}-\lambda_{1})\\|$
	$\displaystyle=$	$\displaystyle~{}\\|A(\nabla f(x_{1})-\nabla f(x^{\star}))+L_{f}(Ax_{1}-Ay_{1})-A\delta_{1}\\|$
	$\displaystyle=$	$\displaystyle~{}\\|A(\nabla f(x_{1})-\nabla f(x^{\star}))+L_{f}(Ax_{1}-b)-L_{f}(Ax_{0}-Ax^{\star})-A\delta_{1}\\|$
	$\displaystyle\leq$	$\displaystyle~{}L_{f}L_{A}\omega_{1}+L_{f}\gamma_{1}+L_{f}L_{A}D+L_{A}\\|\delta_{1}\\|$
	$\displaystyle\leq$	$\displaystyle~{}L_{f}L_{A}\omega_{1}+L_{f}+L_{f}L_{A}D+L_{A}.$

Note that $u_{1}=x_{1}-x^{\star}$ , we have $Au_{1}=Ax_{1}-b$ and

\xi_{1}\leq t_{1}^{2}\gamma_{1}\|\lambda_{1}-\lambda^{\star}\|\leq\frac{1}{\underline{\sigma}_{\min}^{2}}(L_{f}L_{A}\omega_{1}+L_{f}(L_{A}D+1)+L_{A})t_{1}^{2}\gamma_{1},

which completes the proof. ∎

According to Lemma 12 and Lemma 13, it remains an estimate of $\omega_{k}$ to get the upper bound of $a_{k}$ . In the discussion that follows, we prove that $\omega_{k}$ can be bounded by a linear increasing sequence. Utilizing this property, we give the requirements on $\epsilon_{k}$ and $\gamma_{k}$ , and obtain the following result.

Theorem 5.1

Suppose that $f(x)$ is convex, $\|x_{0}-x^{\star}\|\leq D$ , the matrix $A$ satisfies $\left\|A\right\|\leq L_{A}$ and the minimum nonzero singular value of $A$ is no smaller than $\underline{\sigma}_{\min}$ . In Algorithm 4, we set the subproblem error tolerances as $\epsilon_{k}\leq\min\left\{\frac{\sqrt{L_{f}}D}{2\sqrt{2}k^{2}},\sqrt{\frac{2}{L_{f}}}\right\}$ and $\gamma_{k}\leq\min\left\{\frac{\underline{\sigma}_{\min}^{2}L_{f}D^{2}}{8(L_{f}L_{A}(\varpi+D)+4L_{f}+L_{A})},1\right\}\frac{1}{t_{k}^{2}(k+1)^{3}}$ where $\varpi$ is a constant defined in (49). Let $\{x_{k}\}$ be the iterate sequence generated by Algorithm 4. It holds that

v_{k}\leq\frac{16L_{f}D^{2}}{(k+1)^{2}}.

Proof

We can check that the assumptions in Lemma 12 are satisfied: $\gamma_{k}\leq 1$ , $\|\delta_{k}\|=\frac{1}{t_{k}}\sqrt{\frac{L_{f}}{2}}\epsilon_{k}\leq 1$ and $t_{k+1}^{2}\gamma_{k+1}\leq t_{k}^{2}\gamma_{k}$ for any $k\geq 1$ . We also have $\bar{\epsilon}_{k}=\sum_{i=1}^{k}\epsilon_{i}\leq\sqrt{\frac{L_{f}}{2}}D$ and $\sum_{i=1}^{k}\epsilon_{i}^{2}\leq\frac{L_{f}D^{2}}{4}$ .

First, we discuss the upper bound of $\omega_{k}$ . By definition,

\|u_{k}\|=\|t_{k}(x_{k}-x^{\star})-(t_{k}-1)(x_{k-1}-x^{\star})\|\geq t_{k}\omega_{k}-(t_{k}-1)\omega_{k-1},

where the last inequality holds because $t_{k}\geq 1$ . By (40), it holds $\sqrt{b_{k}}=\sqrt{\frac{L_{f}}{2}}\|u_{k}\|\leq\sqrt{\tau}+\sqrt{s_{k}}$ . It follows

t_{k}\omega_{k}-(t_{k}-1)\omega_{k-1}\leq\sqrt{\frac{2}{L_{f}}}(\sqrt{\tau}+\sqrt{s_{k}}),

which implies

\omega_{k}\leq\omega_{k-1}+\sqrt{\frac{2}{L_{f}}}(\sqrt{\tau}+\sqrt{s_{k}}).

Substituting (43) into the above inequality and combining (45), we get

	$\displaystyle\omega_{k}$	$\displaystyle\leq\omega_{k-1}+\sqrt{\frac{2\tau}{L_{f}}}+\sqrt{\frac{1}{2{L_{f}}}}\left(\bar{\epsilon}_{k}+\left(\bar{\epsilon}_{k}^{2}+4\bar{\xi}_{k-1}+4\xi_{k}+4\epsilon_{k}^{2}+4\bar{\epsilon}_{k}\sqrt{\tau}\right)^{1/2}\right)$
		$\displaystyle\leq\omega_{k-1}+\sqrt{\frac{2\tau}{L_{f}}}+\sqrt{\frac{1}{2{L_{f}}}}\bar{\epsilon}_{k}+\sqrt{\frac{1}{2{L_{f}}}}\left(\bar{\epsilon}_{k}^{2}+4\bar{\xi}_{k-1}\right.$
		$\displaystyle\left.\qquad\qquad+\frac{8L_{f}L_{A}t_{k-1}^{2}\gamma_{k-1}}{\underline{\sigma}_{\min}^{2}}\omega_{k}+\frac{8(4L_{f}+L_{A})t_{k-1}^{2}\gamma_{k-1}}{\underline{\sigma}_{\min}^{2}}+4\epsilon_{k}^{2}+4\bar{\epsilon}_{k}\sqrt{\tau}\right)^{1/2}.$

For the simplicity of notation, we denote $C_{0}\mathrel{\mathop{:}}=\sqrt{\frac{2\tau}{L_{f}}}+\sqrt{\frac{1}{2{L_{f}}}}\bar{\epsilon}_{k}$ , $C_{1}\mathrel{\mathop{:}}=\frac{4L_{A}t_{k-1}^{2}\gamma_{k-1}}{\underline{\sigma}_{\min}^{2}}$ ,
$C_{2}\mathrel{\mathop{:}}=\frac{1}{2L_{f}}\left(\bar{\epsilon}_{k}^{2}+4\bar{\xi}_{k-1}+\frac{8(4L_{f}+L_{A})t_{k-1}^{2}\gamma_{k-1}}{\underline{\sigma}_{\min}^{2}}+4\epsilon_{k}^{2}+4\bar{\epsilon}_{k}\sqrt{\tau}\right)$ , and the above inequality can be rewritten into

\omega_{k}\leq\omega_{k-1}+C_{0}+\sqrt{C_{1}}\left(\frac{C_{2}}{C_{1}}+\omega_{k}\right)^{1/2},

which implies

\left(\frac{C_{2}}{C_{1}}+\omega_{k}\right)^{1/2}\leq\frac{1}{2}\left(\sqrt{C_{1}}+\sqrt{C_{1}+4\omega_{k-1}+4C_{0}+4C_{2}/C_{1}}\right).

By simple derivations, we have

\begin{split}\omega_{k}&\leq\frac{C_{1}}{2}+\omega_{k-1}+C_{0}+\frac{1}{2}\sqrt{C_{1}^{2}+4C_{1}\omega_{k-1}+4C_{0}C_{1}+4C_{2}}\\ &\leq\omega_{k-1}+C_{1}+C_{0}+\sqrt{C_{0}C_{1}}+\sqrt{C_{1}\omega_{k-1}}+\sqrt{C_{2}}.\end{split}

(48)

On the one hand, we observe that the upper bound of $\omega_{k}$ depends on $C_{0}$ and $C_{2}$ , which further depend on $\bar{\xi}_{k-1}$ . On the other hand, from (45), we know $\xi_{k}$ can be bounded by $\omega_{k}$ . Observing the relationship, we can prove the upper bound of both $\omega_{k}$ and $\xi_{k}$ by induction. It is easy to derive that $C_{0}\leq\bar{C}_{0}\mathrel{\mathop{:}}=\frac{3D}{2}$ , $C_{1}\leq\bar{C}_{1}\mathrel{\mathop{:}}=\frac{4L_{A}}{\underline{\sigma}_{\min}^{2}}$ and $C_{2}\leq\frac{1}{2L_{f}}\left(3L_{f}D^{2}+4\bar{\xi}_{k-1}+\frac{8(4L_{f}+L_{A})}{\underline{\sigma}_{\min}^{2}}\right)$ . Let $\bar{C}_{2}\mathrel{\mathop{:}}=\frac{1}{2L_{f}}\left(6L_{f}D^{2}+\frac{8(4L_{f}+L_{A})}{\underline{\sigma}_{\min}^{2}}\right)$ , $D_{0}=\bar{C}_{0}+\bar{C}_{1}+\sqrt{\bar{C}_{2}}+\sqrt{\bar{C}_{0}\bar{C}_{1}},D_{1}=\frac{4L_{A}}{\underline{\sigma}_{\min}^{2}}$ . Let $\bar{\omega}_{1}$ be the maximum zero point of (51) and

\varpi=\max\left\{\frac{\left(\sqrt{D_{1}}+\sqrt{D_{1}+4D_{0}}\right)^{2}}{4},\bar{\omega}_{1}\right\}.

(49)

Note that $\bar{C}_{0},\bar{C}_{1},\bar{C}_{2},D_{0},D_{1},\bar{\omega}_{1}$ and $\varpi$ only depend on the constants $L_{A},\underline{\sigma}_{\min},L_{f}$ and $D$ . We aim to prove $\omega_{k}\leq k\varpi$ and $\xi_{k}\leq\frac{L_{f}D^{2}}{4k^{2}}$ .

First, we prove the case when $k=1$ . Note that $\|u_{1}\|=\|x_{1}-x^{\star}\|=\omega_{1}$ . By (40), we have

\omega_{1}\leq\sqrt{\frac{2}{L_{f}}}(\sqrt{\tau}+\sqrt{s_{1}})

Combining with (42) yields

\omega_{1}\leq\sqrt{\frac{2}{L_{f}}}\left(\sqrt{\tau}+\left(\epsilon_{1}^{2}+\xi_{1}+\epsilon_{1}\left(\sqrt{\tau}+\sqrt{\xi_{1}}\right)\right)^{1/2}\right).

(50)

Putting (50) and (44) together and combining $\epsilon_{1}\leq\sqrt{\frac{2}{L_{f}}}$ yield a quartic inequality equation with respect to $\omega_{1}$ :

\begin{split}\omega_{1}\leq\sqrt{\frac{2}{L_{f}}}\left[\sqrt{\tau}+\left(\frac{2}{L_{f}}+\frac{1}{\underline{\sigma}_{\min}^{2}}(L_{f}L_{A}\omega_{1}+L_{f}(L_{A}D+1)+L_{A})\right.\right.\\ \left.\left.+\sqrt{\frac{2}{L_{f}}}\left(\sqrt{\tau}+\sqrt{\frac{1}{\underline{\sigma}_{\min}^{2}}(L_{f}L_{A}\omega_{1}+L_{f}(L_{A}D+1)+L_{A})}\right)\right)^{1/2}\right].\end{split}

(51)

Recall the definition of $\bar{\omega}_{1}$ and $\varpi$ . We get $\omega_{1}\leq\bar{\omega}_{1}\leq\varpi$ . Thus, it holds

\xi_{1}\leq\frac{1}{\underline{\sigma}^{2}_{\min}}(L_{f}L_{A}\varpi+L_{f}(L_{A}D+1)+L_{A})t_{1}^{2}\gamma_{1}\leq\frac{L_{f}D^{2}}{4},

where the last inequality holds because $\gamma_{k}\leq\frac{\underline{\sigma}_{\min}^{2}L_{f}D^{2}}{8(L_{f}L_{A}(\varpi+D)+4L_{f}+L_{A})}\frac{1}{t_{k}^{2}(k+1)^{3}}$ .

Suppose that there exists $k\geq 2$ such $\omega_{i}\leq i\varpi$ and $\xi_{i}\leq\frac{L_{f}D^{2}}{4i^{2}}$ for any $1\leq i\leq k-1$ . Then we have $\sum_{i=1}^{k-1}\xi_{i}\leq\frac{L_{f}D^{2}}{2}$ and $\bar{\xi}_{k-1}=\sum_{i=1}^{k-1}\xi_{i}+\sum_{i=1}^{k-1}\epsilon^{2}\leq\frac{3L_{f}D^{2}}{4}$ . Hence $C_{2}\leq\bar{C}_{2}$ and (48) implies

	$\displaystyle\omega_{k}$	$\displaystyle\leq\omega_{k-1}+\bar{C}_{1}+\bar{C}_{0}+\sqrt{\bar{C}_{0}\bar{C}_{1}}+\sqrt{C_{1}\omega_{k-1}}+\sqrt{\bar{C}_{2}}$
		$\displaystyle\leq(k-1)\varpi+\bar{C}_{1}+\bar{C}_{0}+\sqrt{\bar{C}_{0}\bar{C}_{1}}+\sqrt{(k-1)C_{1}\varpi}+\sqrt{\bar{C}_{2}}$
		$\displaystyle\leq(k-1)\varpi+D_{0}+\sqrt{D_{1}\varpi}$
		$\displaystyle\leq k\varpi,$

where the third inequality is due to $C_{1}\omega_{k-1}\leq\frac{4L_{A}}{\underline{\sigma}_{\min}^{2}}t_{k-1}^{2}\gamma_{k-1}\cdot(k-1)\varpi\leq D_{1}\varpi$ and the last inequality holds because $\varpi$ is greater than the maximum zero point of the equation $D_{0}+\sqrt{D_{1}x}=x$ . Then we can obtain $\xi_{k}\leq\frac{L_{f}D^{2}}{4k^{2}}$ due to (45) and $\gamma_{k}\leq\frac{\underline{\sigma}_{\min}^{2}L_{f}D^{2}}{8(L_{f}L_{A}(\varpi+D)+4L_{f}+L_{A})}\frac{1}{t_{k}^{2}(k+1)^{3}}$ . Consequently, by induction we conclude that $\xi_{k}\leq\frac{L_{f}D^{2}}{4k^{2}}$ for any $k\geq 1$ . It follows $\bar{\xi}_{k}\leq L_{f}D^{2}$ . Combining the fact that $t_{k}\geq(k+1)/2$ and inequality (38) yields

f(x_{k})-f(x^{\star})+\langle\lambda^{\star},Ax_{k}-b\rangle\leq\frac{\left(\sqrt{\frac{L_{f}}{2}}D+\sqrt{\frac{L_{f}}{2}}D\right)^{2}+2L_{f}D^{2}}{(k+1)^{2}/4}\leq\frac{16L_{f}D^{2}}{(k+1)^{2}},

which completes the proof. ∎

Corollary 10

Under the same assumptions and same choices of the parameters $\epsilon_{k}$ and $\gamma_{k}$ in Theorem 5.1, in order to find an approximate solution $x_{k}$ satisfying $v_{k}=f(x_{k})-f(x^{\star})+\langle\lambda^{\star},Ax_{k}-b\rangle\leq\epsilon$ , the total number of gradient evaluations for Algorithm 4 is bounded by

\tilde{O}\left(\kappa_{A}\sqrt{\frac{L_{f}D^{2}}{\epsilon}}\right).

Proof

In Theorem 5.1, we have proved the outer loop complexity is $O\left(\sqrt{\frac{L_{f}D^{2}}{\epsilon}}\right)$ . Let $x_{k}^{\star}=\operatorname*{arg\,min}_{Ax=b}\left\{f(x)+\frac{L_{f}}{2}\|x-y_{k}\|^{2}\right\}$ . The KKT conditions are $\nabla f(x_{k}^{\star})+L_{f}(x_{k}^{\star}-y_{k})+A^{\mathrm{T}}\lambda_{k}^{\star}=0$ and $Ax_{k}^{\star}-b=0$ . By the definition of $\delta_{k}$ and $\zeta_{k}$ , we have

	$\displaystyle\delta_{k}$	$\displaystyle=\nabla f(x_{k})-\nabla f(x_{k}^{\star})+L_{f}(x_{k}-x_{k}^{\star})+A^{\mathrm{T}}(\lambda_{k}-\lambda_{k}^{\star}),$
	$\displaystyle\zeta_{k}$	$\displaystyle=A(x_{k}-x_{k}^{\star}).$

It follows $\|\delta_{k}\|\leq 2L_{f}\|x_{k}-x_{k}^{\star}\|+L_{A}\|\lambda_{k}-\lambda_{k}^{\star}\|$ and $\|\zeta_{k}\|\leq L_{A}\|x_{k}-x_{k}^{\star}\|$ . Let $\tilde{\epsilon}_{k}=\frac{1}{L_{A}+2L_{f}}\cdot\min\left\{\sqrt{\frac{L_{f}}{2}}\cdot\frac{\epsilon_{k}}{t_{k}},\gamma_{k}\right\}$ , then the subroutine only needs to output a pair $(x_{k},\lambda_{k})$ such that $\|x_{k}-x_{k}^{\star}\|\leq\tilde{\epsilon}_{k}$ and $\|\lambda_{k}-\lambda_{k}^{\star}\|\leq\tilde{\epsilon}_{k}$ , the required subproblem error tolerance can be satisfied. The condition number of the objective function of the subproblem (33) is $O(1)$ . Since $\tilde{\epsilon}_{k}$ is the power function of $\epsilon$ and other parameters, we obtain the number of gradient evaluations in each inner iteration is $\tilde{O}(\kappa_{A})$ by Corollary 1. Consequently, the overall complexity is $\tilde{O}\left(\kappa_{A}\sqrt{\frac{L_{f}D^{2}}{\epsilon}}\right)$ . ∎

Remark 4

Note that Algorithm 4 outputs an approximate solution of the last subproblem, which satisfies $\|Ax_{T}-b\|\leq\gamma_{T}$ . If we further require that $\gamma_{T}\leq\min\left\{\frac{\epsilon}{2\|\lambda^{\star}\|},\epsilon\right\}$ , the overall complexity remains unchanged up to logarithmic factors since the inner loop converges linearly. Therefore, in order to obtain an approximate solution $x_{T}$ satisfying $f(x_{T})-f(x^{\star})+\langle\lambda^{\star},Ax_{T}-b\rangle\leq\frac{\epsilon}{2}$ and $\|Ax_{T}-b\|\leq\gamma_{T}$ , the complexity upper bound is still $\tilde{O}\left(\kappa_{A}\sqrt{\frac{L_{f}D^{2}}{\epsilon}}\right)$ . In this case, we can conclude that $f(x_{T})-f(x^{\star})\leq\epsilon$ because $\gamma_{T}\leq\frac{\epsilon}{2\|\lambda^{\star}\|}$ . Therefore, the above result actually implies an complexity upper bound of $\tilde{O}\left(\kappa_{A}\sqrt{\frac{L_{f}D^{2}}{\epsilon}}\right)$ in order to find an approximate solution $x_{T}$ satisfying $f(x_{T})-f(x^{\star})\leq\epsilon$ and $\|Ax_{T}-b\|\leq\epsilon$ .

6 Conclusions

In this work, we analyze the lower and upper bounds of composite optimization problems in strongly convex, convex, and non-convex scenarios. Different from most of previous studies, we specifically consider problem classes with a given condition number $\kappa_{A}$ . Our results demonstrate that the complexities presented are optimal up to logarithmic factors. This study marks the first instance of optimal algorithms for convex and non-convex cases, as well as the first set of lower bounds for all three cases. In the future work, it remains interesting to investigate how algorithms can be designed to further tighten the logarithmic factors in the complexity upper bounds.

Funding This work was supported by the National Natural Science Foundation of China under grant number 11831002.

Data Availability

Declarations

Conflict of interest The authors have no relevant financial interest to disclose.

References

(1) Beck, A., Teboulle, M.: A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM Journal on Imaging Sciences 2(1), 183–202 (2009)
(2) Bertsekas, D.: Dynamic programming and optimal control: Volume I, vol. 1. Athena scientific (2012)
(3) Cai, X., Han, D., Yuan, X.: On the convergence of the direct extension of ADMM for three-block separable convex minimization models with one strongly convex function. Computational Optimization and Applications 66, 39–73 (2017)
(4) Carmon, Y., Duchi, J.C., Hinder, O., Sidford, A.: Lower bounds for finding stationary points I. Mathematical Programming 184(1), 71–120 (2020)
(5) Chambolle, A., Pock, T.: A first-order primal-dual algorithm for convex problems with applications to imaging. Journal of Mathematical Imaging and Vision 40, 120–145 (2011)
(6) Chang, T.H., Hong, M., Liao, W.C., Wang, X.: Asynchronous distributed ADMM for large-scale optimization—part I: Algorithm and convergence analysis. IEEE Transactions on Signal Processing 64(12), 3118–3130 (2016)
(7) Feijer, D., Paganini, F.: Stability of primal–dual gradient dynamics and applications to network optimization. Automatica 46(12), 1974–1981 (2010)
(8) Hamedani, E.Y., Aybat, N.S.: A primal-dual algorithm with line search for general convex-concave saddle point problems. SIAM Journal on Optimization 31(2), 1299–1329 (2021)
(9) Hiriart-Urruty, J., Lemarechal, C.: Convex Analysis and Minimization Algorithms II: Advanced Theory and Bundle Methods. Grundlehren der mathematischen Wissenschaften. Springer Berlin Heidelberg (1996). URL https://books.google.com.sg/books?id=aSizI0n6tnsC
(10) Hong, M., Hajinezhad, D., Zhao, M.M.: Prox-PDA: The proximal primal-dual algorithm for fast distributed nonconvex optimization and learning over networks. In: International Conference on Machine Learning, pp. 1529–1538. PMLR (2017)
(11) Jiang, K., Sun, D., Toh, K.C.: An inexact accelerated proximal gradient method for large scale linearly constrained convex sdp. SIAM Journal on Optimization 22(3), 1042–1064 (2012)
(12) Kong, W., Melo, J.G., Monteiro, R.D.: Complexity of a quadratic penalty accelerated inexact proximal point method for solving linearly constrained nonconvex composite programs. SIAM Journal on Optimization 29(4), 2566–2593 (2019)
(13) Kong, W., Melo, J.G., Monteiro, R.D.: Iteration complexity of an inner accelerated inexact proximal augmented lagrangian method based on the classical lagrangian function. SIAM Journal on Optimization 33(1), 181–210 (2023)
(14) Lin, T., Ma, S., Zhang, S.: On the global linear convergence of the ADMM with multiblock variables. SIAM Journal on Optimization 25(3), 1478–1497 (2015)
(15) Lin, T., Ma, S., Zhang, S.: On the sublinear convergence rate of multi-block ADMM. Journal of the Operations Research Society of China 3, 251–274 (2015)
(16) Mokhtari, A., Ozdaglar, A.E., Pattathil, S.: Convergence rate of $\mathcal{O}(1/k)$ for optimistic gradient and extragradient methods in smooth convex-concave saddle point problems. SIAM Journal on Optimization 30(4), 3230–3251 (2020)
(17) Nesterov, Y., et al.: Lectures on convex optimization, vol. 137. Springer (2018)
(18) Noschese, S., Pasquini, L., Reichel, L.: Tridiagonal Toeplitz matrices: properties and novel applications. Numerical Linear Algebra with Applications 20(2), 302–326 (2013)
(19) Ouyang, Y., Chen, Y., Lan, G., Pasiliao Jr, E.: An accelerated linearized alternating direction method of multipliers. SIAM Journal on Imaging Sciences 8(1), 644–681 (2015)
(20) Ouyang, Y., Xu, Y.: Lower complexity bounds of first-order methods for convex-concave bilinear saddle-point problems. Mathematical Programming 185(1-2), 1–35 (2021)
(21) Salim, A., Condat, L., Kovalev, D., Richtárik, P.: An optimal algorithm for strongly convex minimization under affine constraints. In: International Conference on Artificial Intelligence and Statistics, pp. 4482–4498. PMLR (2022)
(22) Shi, W., Ling, Q., Wu, G., Yin, W.: Extra: An exact first-order algorithm for decentralized consensus optimization. SIAM Journal on Optimization 25(2), 944–966 (2015)
(23) Showalter, R.: Monotone operators in banach space and nonlinear partial differential equations. Math. Surv. Mono. 49 (1997)
(24) Song, C., Jiang, Y., Ma, Y.: Breaking the ${O}(1/\epsilon)$ optimal rate for a class of minimax problems. arXiv preprint arXiv:2003.11758 (2020)
(25) Sun, H., Hong, M.: Distributed non-convex first-order optimization and information processing: Lower complexity bounds and rate optimal algorithms. IEEE Transactions on Signal processing 67(22), 5912–5928 (2019)
(26) Xu, Y.: Accelerated first-order primal-dual proximal methods for linearly constrained composite convex programming. SIAM Journal on Optimization 27(3), 1459–1484 (2017)
(27) Xu, Y.: First-order methods for constrained convex programming based on linearized augmented lagrangian function. INFORMS Journal on Optimization 3(1), 89–117 (2021)
(28) Xu, Y.: Iteration complexity of inexact augmented lagrangian methods for constrained convex programming. Mathematical Programming 185, 199–244 (2021)
(29) Xu, Y.: First-order methods for problems with $o(1)$ functional constraints can have almost the same convergence rate as for unconstrained problems. SIAM Journal on Optimization 32(3), 1759–1790 (2022)
(30) Yang, J., Zhang, Y.: Alternating direction algorithms for $\backslash$ ell_1-problems in compressive sensing. SIAM journal on scientific computing 33(1), 250–278 (2011)
(31) Zhang, J., Luo, Z.Q.: A proximal alternating direction method of multiplier for linearly constrained nonconvex minimization. SIAM Journal on Optimization 30(3), 2272–2302 (2020)
(32) Zhang, J., Luo, Z.Q.: A global dual error bound and its application to the analysis of linearly constrained nonconvex optimization. SIAM Journal on Optimization 32(3), 2319–2346 (2022)
(33) Zhu, M., Chan, T.: An efficient primal-dual hybrid gradient algorithm for total variation image restoration. Ucla Cam Report 34, 8–34 (2008)
(34) Zhu, Z., Chen, F., Zhang, J., Wen, Z.: A unified primal-dual algorithm framework for inequality constrained problems. arXiv preprint arXiv:2208.14196 (2022)

	$\displaystyle h_{\rho}(x)$	$\displaystyle=\sup_{\\|y\\|\leq\rho}\{y^{\mathrm{T}}(x-x^{\prime})+y^{\mathrm{T}}x^{\prime}-h^{\star}(y)\}$
		$\displaystyle\leq\sup_{\\|y\\|\leq\rho}y^{\mathrm{T}}(x-x^{\prime})+\sup_{\\|y\\|\leq\rho}\{y^{\mathrm{T}}x^{\prime}-h^{\star}(y)\}$
		$\displaystyle\leq\rho\\|x-x^{\prime}\\|+h_{\rho}(x^{\prime}),$

	$\displaystyle\tilde{f}(\lambda^{\prime})-\tilde{f}(\lambda)$	$\displaystyle\geq\langle\nabla f^{\star}(-A^{\mathrm{T}}\lambda),-A^{\mathrm{T}}\lambda^{\prime}+A^{\mathrm{T}}\lambda\rangle+\frac{1}{2L_{f}}\\|A^{\mathrm{T}}\lambda-A^{\mathrm{T}}\lambda^{\prime}\\|^{2}$
		$\displaystyle=\langle\nabla\tilde{f}(\lambda),\lambda^{\prime}-\lambda\rangle+\frac{1}{2L_{f}}\\|A^{\mathrm{T}}\lambda-A^{\mathrm{T}}\lambda^{\prime}\\|^{2}$
		$\displaystyle\geq\langle\nabla\tilde{f}(\lambda),\lambda^{\prime}-\lambda\rangle+\frac{\mu_{A}^{2}}{2L_{f}}\\|\lambda-\lambda^{\prime}\\|^{2},$

	$\displaystyle\left\\|x_{k-1}-x^{\star}_{k}\right\\|\leq$	$\displaystyle~{}\frac{L_{A}}{\mu_{f}}\left\\|\lambda_{k-1}-\lambda_{k-2}\right\\|+\delta_{k-1},$		(14)
	$\displaystyle\left\\|x_{k}-x^{\star}\right\\|\leq$	$\displaystyle~{}\frac{L_{A}}{\mu_{f}}\left\\|\lambda_{k-1}-\lambda^{\star}\right\\|+\delta_{k}.$		(15)

	$\displaystyle\mu_{f}\left\\|x^{\star}_{k}-x^{\star}_{k-1}\right\\|\leq$	$\displaystyle~{}\left\\|\nabla\Psi_{k}(x^{\star}_{k})-\nabla\Psi_{k}(x^{\star}_{k-1})\right\\|=\left\\|\nabla\Psi_{k-1}(x^{\star}_{k-1})-\nabla\Psi_{k}(x^{\star}_{k-1})\right\\|$
	$\displaystyle=$	$\displaystyle~{}\left\\|A^{\mathrm{T}}\nabla\hat{h}(Ax^{\star}_{k-1}-b+\ell\lambda_{k-2})\!-\!A^{\mathrm{T}}\nabla\hat{h}(Ax^{\star}_{k-1}-b+\ell\lambda_{k-1})\right\\|$
	$\displaystyle\leq$	$\displaystyle~{}\left\\|A\right\\|\cdot\frac{1}{\ell}\cdot\ell\left\\|\lambda_{k-2}-\lambda_{k-1}\right\\|=\left\\|A\right\\|\left\\|\lambda_{k-2}-\lambda_{k-1}\right\\|,$

		$\displaystyle\left\\|x_{k}-\mathbf{prox}_{\frac{1}{L_{f}}h(A\cdot-b)}\left(x_{k}-\frac{1}{2L_{f}}\nabla f(x_{k})\right)\right\\|$
	$\displaystyle\leq$	$\displaystyle\ \\|x_{k}-x_{k+1}^{\star}\\|+\left\\|x_{k+1}^{\star}-\mathbf{prox}_{\frac{1}{L_{f}}h(A\cdot-b)}\left(x_{k}-\frac{1}{2L_{f}}\nabla f(x_{k})\right)\right\\|$
	$\displaystyle\leq$	$\displaystyle\ 2\\|x_{k}-x_{k+1}^{\star}\\|,$