This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

11institutetext: Jiachen Jin 22institutetext: [email protected] 33institutetext: Hongxia Wang, Corresponding author 44institutetext: [email protected] 55institutetext: Kangkang Deng 66institutetext: [email protected]

College of Science, National University of Defense Technology, Changsha, 410073, Hunan, China.

Anderson acceleration of derivative-free projection methods for constrained monotone nonlinear equations

Jiachen Jin    Hongxia Wang    Kangkang Deng
(Received: date / Accepted: date)
Abstract

The derivative-free projection method (DFPM) is an efficient algorithm for solving monotone nonlinear equations. As problems grow larger, there is a strong demand for speeding up the convergence of DFPM. This paper considers the application of Anderson acceleration (AA) to DFPM for constrained monotone nonlinear equations. By employing a nonstationary relaxation parameter and interleaving with slight modifications in each iteration, a globally convergent variant of AA for DFPM named as AA-DFPM is proposed. Further, the linear convergence rate is proved under some mild assumptions. Experiments on both mathematical examples and a real-world application show encouraging results of AA-DFPM and confirm the suitability of AA for accelerating DFPM in solving optimization problems.

Keywords:
Anderson acceleration Derivative-free projection method Monotone nonlinear equations Convergence
MSC:
65K05 90C56 68U10

1 Introduction

In this paper, we focus on solving the following monotone nonlinear equations with convex constraint:

F(x)=0,x𝒞,F(x)=0,\ x\in\mathcal{C}, (1)

where F:nnF:\mathbb{R}^{n}\to\mathbb{R}^{n} is a continuous and monotone mapping, and 𝒞n\mathcal{C}\subset\mathbb{R}^{n} is a closed convex set. The monotonicity of the mapping FF means that

(F(x)F(y))(xy)0,x,yn.(F(x)-F(y))^{\top}(x-y)\geq 0,~{}\forall~{}x,~{}y\in\mathbb{R}^{n}.

The systems of nonlinear equations have numerous applications, such as chemical equilibrium systems MM87 , split feasibility problems QX05 and neural networks CZ14 . Further, some concrete application models in real world are monotone. For instance, compressed sensing is firstly formulated for a convex quadratic programming, and then for an equivalent monotone nonlinear equations XWH11 . Regularized decentralized logistic regression also can be expressed as monotone nonlinear equations JYTH22 . As observation techniques advance, observed date size expands, and the requirement for resolution of results increases, the scale of nonlinear equations enlarges accordingly.

Various iterative methods for solving (1) include Newton method SWQ02 , trust-region algorithm QTL04 , Levenberg-Marquardt method YJM23 , etc. Although these methods perform well theoretically and numerically, they have difficulties in dealing with large-scale equations due to the computation of Jacobian matrix or its approximation. In contrast, on the basis of the hyperplane projection technique for monotone equations SS98 and the first-order optimization methods for unconstrained optimization, many derivative-free projection methods (DFPM) have sprung up [10-20] for convex-constrained monotone nonlinear equations, whose computational cost in each iteration is only to calculate function values.

The search direction and line search procedure are crucial for DFPM, and different constructions of theirs correspond to different variants of DFPM. Benefiting from the simple structure and low storage capacity of conjugate gradient methods (CGM), the conjugate gradient projection methods (CGPM) based on the design of search direction in CGM provide a class of competitive algorithms, for instance, CGPM LF19 ; LLXWT22 , spectral CGPM IKRPA22 ; WSLZ23 , three-term CGPM YJJLW21 ; WA22 ; MJJYH23 . Meanwhile, different determination of line search procedure may obtain different convergence properties. Some line search procedures have been proposed for DFPM in solving constrained monotone nonlinear equations (see YJJLW21 ; ZZ06 ; OL18 ; AK15 ; LL11 for instance). Although there have been many studies on the DFPM for solving problem (1), almost all of these existing studies focus on specific algorithms. Only a few papers have discussed unified studies on this class of methods partially (see OL23 ; GM23 ), which motivate us to center on a general framework of DFPM and its convergence analysis.

In order to construct more efficient numerical algorithms, a promising strategy that has recently emerged in a number of fields is to embed acceleration techniques in the underlying algorithms. Anderson acceleration (AA) was originally designed for integral equations Anderson65 and is now a very popular acceleration method for fixed-point schemes. AA can be viewed as an extension of the momentum methods, such as inertial acceleration Alvarez00 and Nesterov acceleration Nesterov83 . The idea differs from theirs in maintaining information of previous steps rather than just two last iterates, and update iteration as a linear combination of the information with dynamic weights. Some studies have explored the connection between AA and other classical methods, which also facilitates the understanding of AA. For linear problems, Walker and Ni WN11 showed that AA is related to the well-known generalized minimal residual algorithm (GMRES SS86 ). Potra and Engler PE13 demonstrated the equivalence between GMRES and AA with any mixing parameters under full-memory (i.e., m=m=\infty in Algorithm 2). For nonlinear cases, AA is also closely related to the nonlinear GMRES WHS21 . Fang and Saad FS09 identified the relationship between AA and the multi-secant quasi-Newton methods.

Although AA often exhibits superior numerical performance in speeding up fixed-point computations with countless applications, such as reinforcement learning WCHZZQ23 , numerical methods for PDE SRX19 and seismic inversion Yang21 , it is known to only converge locally in theory TK15 ; MJ20 . The convergence analysis of most, if not all, existing methods requires additional assumptions. More recent result in Bian and Chen BC22 proved that AA for m=1m=1 is q-linear convergent with a smaller q-factor than existing q-factors for the composite max fixed-point problem, where functions are Lipschitz continuously differentiable. Rebholz and Xiao RX23 investigated the effect of AA on superlinear and sublinear convergence of various fixed-point iteration, with the operator satisfying certain properties. Garner et al. GLZ23 proved that AA improves the root-linear convergence factor over fixed-point iteration when the operator is linear and symmetric or is nonlinear but has a symmetric Jacobian at the solution.

The efficient procedure of AA in solving wide applications further motivates us to investigate this technique to DFPM for solving problem (1). Our main goal in this paper is hence to provide a globally convergent AA of general DFPM without any further assumptions other than monotonicity. Clearly, the work is an extension of recent inertial DFPM in IKRPA22 ; WSLZ23 ; MJJYH23 due to utilizing more information than just last two iterates. The main contributions of this paper are outlined below:

\bullet An accelerated version of DFPM combined with AA (AA-DFPM) is proposed to solve convex-constrained monotone nonlinear equations. Fully exploiting the optimization problem structure, several modifications are considered to the acceleration algorithm. To the best of our knowledge, this is the first application of AA in DFPM.

\bullet A self-contained proof for the global convergence of AA-DFPM is given with no additional assumptions apart from monotonicity on the nonlinear mapping. We further discuss the convergence rate under some standard assumptions.

\bullet The numerical experiments on large-scale constrained nonlinear equations and decentralized logistic regression demonstrate that AA-DFPM outperforms the corresponding DFPM in terms of efficiency and robustness.

The paper is organized as follows. In Section 2, we start by outlining the unified algorithmic framework of DFPM and its convergence results. Based on DFPM with the convergence gained, in Section 3, we further introduce AA and extend the acceleration technique to DFPM. The convergence analysis of AA-DFPM is established under some mild assumptions in Section 4. We report the numerical results of AA-DFPM on large-scale constrained nonlinear equations and a machine learning problem in Section 5. The conclusion is given in Section 6.

Notation. Throughout the paper, we denote by \|\cdot\| be the Euclidean norm on n\mathbb{R}^{n} and Fk:=F(xk)F_{k}:=F(x_{k}). For a closed convex set 𝒞\mathcal{C}, dist(xk,𝒞){\rm dist}(x_{k},\mathcal{C}) denotes the distance from an iterate xkx_{k} to 𝒞\mathcal{C} and the projection operator P𝒞[x]=argmin{zx|z𝒞}P_{\mathcal{C}}[x]={\rm argmin}\{\|z-x\|~{}|~{}z\in\mathcal{C}\}. Furthermore, it has the nonexpensive property:

P𝒞[x]P𝒞[y]xy,x,yn.\|P_{\mathcal{C}}[x]-P_{\mathcal{C}}[y]\|\leq\|x-y\|,~{}\forall~{}x,y\in\mathbb{R}^{n}.

2 Derivative-free projection method

In this section, we review a comprehensive framework of DFPM and recollect its theoretical results. Throughout the paper, we assume that the solution set 𝒮\mathcal{S} of problem (1) is nonempty.

2.1 General framework of DFPM

The core of DFPM is the hyperplane projection technique SS98 . It projects the current iterate onto a hyperplane constructed based on the monotonicity of the mapping, which separates the current iterate from the solution effectively. In general, for a given current iterate xkx_{k}, a search direction dkd_{k} is computed first, then a stepsize αk{\alpha_{k}} is calculated by a line search to satisfy

F(zk)(xkzk)>0,F(z_{k})^{\top}(x_{k}-z_{k})>0,

where zk=xk+αkdkz_{k}=x_{k}+\alpha_{k}d_{k}. By the monotonicity of FF, we have

F(zk)(xzk)=(F(zk)F(x))(xzk)0,x𝒮.F(z_{k})^{\top}(x^{*}-z_{k})=(F(z_{k})-F(x^{*}))^{\top}(x^{*}-z_{k})\leq 0,~{}\forall x^{*}\in\mathcal{S}.

Thus the hyperplane

Hk:={xn|F(zk)(xzk)=0}H_{k}:=\{x\in\mathbb{R}^{n}|F(z_{k})^{\top}(x-z_{k})=0\}

strictly separates the current iterate xkx_{k} from any solution xx^{*}. Projecting xkx_{k} first onto the separating hyperplane HkH_{k} then onto the feasible set 𝒞\mathcal{C}, xk+1=P𝒞[PHk[xk]]x_{k+1}=P_{\mathcal{C}}\left[P_{H_{k}}\left[x_{k}\right]\right]. Separation arguments show that dist(xk,𝒞){{\rm dist}(x_{k},\mathcal{C})} decreases monotonically with the increase of kk, which essentially ensures the global convergence of DFPM.

From the above process, the determination of dkd_{k} and αk\alpha_{k} plays a crucial role in DFPM. Different choices of direction or stepsize lead to different variants of DFPM. Competitive DFPM includes conjugate gradient projection method LF19 , spectral conjugate gradient projection method IKRPA22 , three-term conjugate gradient projection method WA22 . We concentrate on a unified framework for DFPM in Algorithm 1.

Input: initial point x0𝒞x_{0}\in\mathcal{C}, parameters γ,σ,ϵ>0\gamma,~{}\sigma,~{}\epsilon>0, 0<s1s20<s_{1}\leq s_{2}, ρ(0,1)\rho\in(0,1), 0t1t20\leq t_{1}\ll t_{2}, ζ(0,2)\zeta\in(0,2). Set k:=0k:=0.
Step 1. Compute FkF_{k}. If Fk<ϵ\|F_{k}\|<\epsilon, then stop; otherwise, go to Step 2.
Step 2. Compute the search direction dkd_{k} such that
Fkdks1Fk2,F_{k}^{\top}d_{k}\leq-s_{1}\|F_{k}\|^{2}, (2)
dks2Fk.\|d_{k}\|\leq s_{2}\|F_{k}\|. (3)
Step 3. Set zk=xk+αkdkz_{k}=x_{k}+\alpha_{k}d_{k}, where αk=γρik{\alpha_{k}}=\gamma\rho^{i_{k}} with iki_{k} being the smallest nonnegative integer ii such that
F(xk+γρidk)dkσγρiP[t1,t2][F(xk+γρidk)]dk2.-F(x_{k}+\gamma\rho^{i}d_{k})^{\top}{d_{k}}\geq\sigma\gamma\rho^{i}P_{[t_{1},t_{2}]}[\|F(x_{k}+\gamma\rho^{i}d_{k})\|]\|d_{k}\|^{2}. (4)
Step 4. Yield the next iteration by
xk+1=P𝒞[xkζukF(zk)],x_{k+1}=P_{\mathcal{C}}\left[x_{k}-\zeta u_{k}F(z_{k})\right], (5)
where uk=F(zk)(xkzk)F(zk)2u_{k}=\frac{F(z_{k})^{\top}(x_{k}-{z_{k}})}{\|F(z_{k})\|^{2}}. Let k:=k+1k:=k+1, and go to Step 1.
Output: xkx_{k}
Algorithm 1 General framework of DFPM

Algorithm 1 is a special case of Algorithm UAF OL23 that adopts the line search scheme VI. We focus on this scenario since it is representative of DFPM. Several general characters of the framework are analyzed as follows.

Search direction dkd_{k}. The conditions (2) and (3) for dkd_{k} are to guarantee the global convergence. If FF is the gradient of a function f:nf:\mathbb{R}^{n}\to\mathbb{R}, then (2) indicates that dkd_{k} is a sufficient descent direction for ff at xkx_{k}. Further, condition (2) implies that the line search procedure (4) is well-defined. Condition (3) means that the magnitude of dkd_{k} should be proportional to the magnitude of FkF_{k}. The way to obtain dkd_{k} satisfying (2) and (3) depends on the particular instance of the framework. For example, the directions in IKRPA22 ; WSLZ23 ; WA22 ; MJJYH23 ; AK15 all satisfy these conditions. Three specific examples are presented in Section 5.

Line search procedure. Note that ηk(i):=P[t1,t2][F(xk+γρidk)],0t1t2{\eta_{k}}(i):=P_{[t_{1},t_{2}]}[\|F(x_{k}+\gamma\rho^{i}d_{k})\|],~{}0\leq t_{1}\ll t_{2} in right-hand of (4) can be replace by other procedures, for instance ηk(i)=λk+(1λk)xk+γρidk,λk(0,1]{\eta_{k}}(i)=\lambda_{k}+(1-\lambda_{k})\|x_{k}+\gamma\rho^{i}d_{k}\|,~{}\lambda_{k}\in(0,1] in OL18 . Here we only focus on this case since it is a adaptive line search procedure recently proposed by Yin et al. YJJLW21 and is widely used to compute a stepsize WSLZ23 ; MJJYH23 . More specifically, if t1=t2=1t_{1}=t_{2}=1, then ηk(i)=1{\eta_{k}}(i)=1, and thus it reduces to the procedure in ZZ06 ; If t1=0t_{1}=0 and t2t_{2} is large enough, then ηk(i)=F(xk+γρidk){\eta_{k}}(i)=\|F(x_{k}+\gamma\rho^{i}d_{k})\|, and thus it reduces to the procedure in LL11 . The projection technique in (4) prevents the right-hand side of (4) from being too small or too large, which effectively reduces the computational cost of Step 3.

Projection strategy. The relaxation factor ζ(0,2)\zeta\in(0,2) in (5) serves as a parameter that can enhance the convergence, as stated in CGH14 . When ζ=1\zeta=1, it corresponds to the original strategy presented in SS98 .

2.2 Global convergence

We present two simple results to show the global convergence of Algorithm 1. Based on (2) and (3), the proofs are similar to those of the results in corresponding literature, so we omit them here.

Lemma 1

(YJJLW21, , Lemma 4) Suppose the sequences {xk}\{x_{k}\} and {zk}\{z_{k}\} are generated by Algorithm 1. Then the following two claims hold.
(i) For any x𝒮x^{*}\in\mathcal{S}, {xkx}\{\|x_{k}-x^{*}\|\} is convergent.
(ii) {xk}\{x_{k}\}, {dk}\{d_{k}\} and {zk}\{z_{k}\} are all bounded, and limkαkdk=0\lim\limits_{k\to\infty}\alpha_{k}\|d_{k}\|=0.

Theorem 2.1

(OL23, , Theorem 3.6) Let sequence {xk}\{x_{k}\} be generated by Algorithm 1. Then the sequence {xk}\{x_{k}\} converges to a solution of problem (1).

3 Anderson acceleration for DFPM

Having seen the convergence for the underlying algorithm, we proceed to show how Anderson acceleration (AA) may translate to improved the convergence behavior for DFPM.

3.1 Anderson acceleration

AA is an efficient acceleration method for fixed-point iteration xk+1=G(xk)x_{k+1}=G(x_{k}). The key idea of AA is to form a new extrapolation point by using the past iterates. To generate a better estimation xk+1x_{k+1}, it searches for a point x¯k\bar{x}_{k} that has smallest residual within the subspace spanned by the m+1m+1 most recent iterates. Let x¯k=j=0majkxj,mk\bar{x}_{k}=\sum\limits_{j=0}^{m}a^{k}_{j}x_{j},~{}m\leq k and j=0majk=1\sum\limits_{j=0}^{m}a^{k}_{j}=1, AA seeks to find a vector of coefficients ak=(a0k,,amk)a^{k}=(a^{k}_{0},\dots,a^{k}_{m})^{\top} such that

ak=argminG(x¯k)x¯k.a^{k}=\arg\min\|G(\bar{x}_{k})-\bar{x}_{k}\|.

However, it is hard to find aka^{k} for a general nonlinear mapping GG. AA uses

G(x¯k)=G(j=0majkxj)j=0majkG(xj),G(\bar{x}_{k})=G\left(\sum\limits_{j=0}^{m}a^{k}_{j}x_{j}\right)\doteq\sum\limits_{j=0}^{m}a^{k}_{j}G(x_{j}),

where

ak=argminj=0majkG(xj)j=0majkxj=argminj=0majkrja^{k}=\arg\min\left\|\sum\limits_{j=0}^{m}a^{k}_{j}G(x_{j})-\sum\limits_{j=0}^{m}a^{k}_{j}x_{j}\right\|=\arg\min\left\|\sum\limits_{j=0}^{m}a^{k}_{j}r_{j}\right\|

with rk=G(xk)xkr_{k}=G(x_{k})-x_{k} to perform an approximation. While aka^{k} is computed, the next iterate of AA is then generated by the following mixing

xk+1=(1bk)j=kmkkajkxj+bkj=kmkkajkG(xj).x_{k+1}=(1-b_{k})\sum\limits_{j=k-{m_{k}}}^{k}a^{k}_{j}x_{j}+b_{k}\sum\limits_{j=k-{m_{k}}}^{k}a^{k}_{j}G(x_{j}).

A formal algorithmic description of AA with the window of length mkm_{k} is given by Algorithm 2.

Input: initial point x0x_{0}, parameters m>0m>0, bk(0,1]b_{k}\in(0,1].
x1=G(x0)x_{1}=G(x_{0});
for k=1,,K1k=1,\dots,K-1 do
       mk=min{m,k}m_{k}=\min\{m,k\}, rk=G(xk)xkr_{k}=G(x_{k})-x_{k};
       Rk=(rkmkk,,rkk)R_{k}=(r^{k}_{k-{m_{k}}},\dots,r^{k}_{k})^{\top};
       Solve minak=(akmkk,,akk)Rkak2\min\limits_{a^{k}=(a^{k}_{k-{m_{k}}},\dots,a^{k}_{k})^{\top}}\left\|R_{k}^{\top}a^{k}\right\|^{2}, subject to j=kmkkajk=1\sum\limits_{j=k-{m_{k}}}^{k}a^{k}_{j}=1;
       xk+1=(1bk)j=kmkkajkxj+bkj=kmkkajkG(xj).x_{k+1}=(1-b_{k})\sum\limits_{j=k-{m_{k}}}^{k}a^{k}_{j}x_{j}+b_{k}\sum\limits_{j=k-{m_{k}}}^{k}a^{k}_{j}G(x_{j}).
end for
return xKx_{K}
Algorithm 2 Anderson acceleration (AA)

In each iteration in Algorithm 2, AA incorporates useful information from previous mkm_{k} iterates by an affine combination, where the coefficient aka^{k} is computed as the solution of a least squares problem, rather than expending evaluation directly at current iterate. The window size mm indicates how many history iterates will be used in the algorithm and its value is typically no larger than 10 in practice. If m=0m=0, AA reduces to the fixed-point iteration. When bkb_{k} is a constant independent of kk, Algorithm 2 is referred to as stationary AA. Many works (see WN11 ; TK15 ; MJ20 ; BC22 ) take bk1b_{k}\equiv 1 to simplify the analysis. Here we consider a nonstationary case, and the expression of bkb_{k} is given in (9).

3.2 Acceleration algorithm

Based on the convergence result in Section 2, we incorporate AA into DFPM and give the resulting algorithm, named AA-DFPM, in Algorithm 3. Note that DFPM may not a fixed-point iteration for xkx_{k} due to the direction dkd_{k} may involve other parameters. However, since AA is a sequence acceleration technique, we expect DFPM to gain a speedup as long as it is convergent.

Input: initial point x0𝒞x_{0}\in\mathcal{C}, parameters m,c,γ,σ,ϵ>0m,~{}c,~{}\gamma,~{}\sigma,~{}\epsilon>0, 0<s1s20<s_{1}\leq s_{2}, ρ(0,1)\rho\in(0,1), 0t1t20\leq t_{1}\ll t_{2}, ζ(0,2)\zeta\in(0,2), bk(0,1]b_{k}\in(0,1]. Set k:=0k:=0.
Step 1. Compute FkF_{k}. If Fk<ϵ\|F_{k}\|<\epsilon, then stop; otherwise, go to Step 2.
Step 2. Compute the search direction dkd_{k} satisfying (2) and (3).
Step 3. Choose the stepsize αk{\alpha_{k}} satisfying (4), and set zk=xk+αkdkz_{k}=x_{k}+\alpha_{k}d_{k}.
Step 4. Calculate
vk=P𝒞[xkζukF(zk)],v_{k}=P_{\mathcal{C}}\left[x_{k}-\zeta u_{k}F(z_{k})\right],
where uk=F(zk)(xkzk)F(zk)2u_{k}=\frac{F(z_{k})^{\top}(x_{k}-{z_{k}})}{\|F(z_{k})\|^{2}}. If F(vk)<ϵ\|F(v_{k})\|<\epsilon, stop; otherwise, go to Step 5.
Step 5. Anderson acceleration for k0k\neq 0: set mk=min{m,k}m_{k}=\min\{m,k\}, rk=vkxkr_{k}=v_{k}-x_{k}. Let ak=(akmkk,,akk)a^{k}=(a^{k}_{k-{m_{k}}},\dots,a^{k}_{k})^{\top}, Rk=(rkmkk,,rkk)R_{k}=(r^{k}_{k-{m_{k}}},\dots,r^{k}_{k})^{\top}, and solve
minakRkak2,subjecttoj=kmkkajk=1,ajk0,j=kmk,,k.\min\limits_{a^{k}}\left\|R_{k}^{\top}a^{k}\right\|^{2},~{}{\rm subject~{}to}\sum\limits_{j=k-{m_{k}}}^{k}a^{k}_{j}=1,~{}a^{k}_{j}\geq 0,~{}j=k-{m_{k}},\dots,k.
xkAA=(1bk)j=kmkkajkxj+bkj=kmkkajkvj.x_{k}^{AA}=(1-b_{k})\sum\limits_{j=k-{m_{k}}}^{k}a^{k}_{j}x_{j}+b_{k}\sum\limits_{j=k-{m_{k}}}^{k}a^{k}_{j}v_{j}. (6)
If j=kmkkajkxjvkck(1+ϵ)\left\|\sum\limits_{j=k-{m_{k}}}^{k}a^{k}_{j}x_{j}-v_{k}\right\|\leq ck^{-(1+\epsilon)}, then
xk+1=xkAA,elsexk+1=vk.x_{k+1}=x_{k}^{AA},{\rm else}~{}x_{k+1}=v_{k}.
Let k:=k+1k:=k+1, and go to Step 1.
Output: xkx_{k}
Algorithm 3 AA-DFPM for (1)

Some implementation techniques in the algorithm bear further comment. We thus introduce and discuss the following four aspects.

Feasibility of accelerated iterate. As illustrated in Algorithm 2, AA computes the accelerated iterate via an affine combination of previous iterates, the accelerated point may violate the constraint unless its feasible set is affine. Considering that the feasible set 𝒞\mathcal{C} in problem (1) is closed convex and the previous iterates xkx_{k} be generated by Algorithm 1 are all in 𝒞\mathcal{C}, we set ajk0,j=kmk,,ka^{k}_{j}\geq 0,~{}j=k-{m_{k}},\dots,k, in Algorithm 3 (line 3 of Step 5) to obtain a reliable accelerated iterate. This means that the accelerated iterate here is computed by a convex combination of previous iterates.

Computation of coefficient aka^{k}. The residual matrix RkR_{k} in the least squares problem of AA can be rank-deficient, then ill-conditioning may occur in computing aka^{k}. Here a Tikhonov regularizationSDB16 λak2,λ>0\lambda\|a^{k}\|^{2},~{}\lambda>0 be added to the problem to obtain a reliable aka^{k}. In this case, combining with the above implementation, the least squares problem is a standard quadratic programming problem that can be solved by the MATLAB command “quadprog”.

Guarantee of convergence. As aforementioned, since AA is known to only converge locally, some globalization mechanisms are required to use it in practice, such as adaptive regularization OTMD23 , restart checking HV19 and safeguarding step OPYZD20 . Following ZOB20 , we introduce a safeguard checking (line 5 of Step 5 in Algorithm 3) to ensure the global convergence, thus

k=1j=kmkkajkxjvk<.\sum_{k=1}^{\infty}\left\|\sum_{j=k-{m_{k}}}^{k}a^{k}_{j}x_{j}-v_{k}\right\|<\infty. (7)

Calculation of bkb_{k}. Define the following averages with the solution aka^{k} to the least square problem in Step 5 of Algorithm 3,

xka=j=kmkkajkxj,vka=j=kmkkajkvj.x_{k}^{a}=\sum\limits_{j=k-{m_{k}}}^{k}a^{k}_{j}x_{j},~{}v_{k}^{a}=\sum\limits_{j=k-{m_{k}}}^{k}a^{k}_{j}v_{j}.

Then (6) becomes

xkAA=(1bk)xka+bkvka=xka+bk(vkaxka).x_{k}^{AA}=(1-b_{k})x_{k}^{a}+b_{k}v_{k}^{a}=x_{k}^{a}+b_{k}(v_{k}^{a}-x_{k}^{a}). (8)

The relaxation parameter bkb_{k} is generally determined heuristically. Many discussions choose bk1b_{k}\equiv 1, thereby simplifying the expression to facilitate theoretical analysis. Little attention has been paid to nonstationary case . As Anderson wished in his comment Anderson19 , we design a dynamic factor

bk=min{b,1k(1+ϵ)vkaxka},b(0,1).b_{k}=\min\left\{b,\frac{1}{k^{(1+\epsilon)}\|v_{k}^{a}-x_{k}^{a}\|}\right\},~{}b\in(0,1). (9)

The adaptive idea is derived from the inertial-based algorithms MJJYH23 ; CMY14 . Then for all kk, we have bkvkaxkak(1+ϵ)b_{k}\|v_{k}^{a}-x_{k}^{a}\|\leq k^{-(1+\epsilon)}, which implies that

k=1bkvkaxka<.\sum_{k=1}^{\infty}b_{k}\|v_{k}^{a}-x_{k}^{a}\|<\infty. (10)

4 Convergence analysis

We temporarily assume for simplicity that a solution of (1) is not found in finite steps. We first present the following lemma to help us complete the proof.

Lemma 2

Polyak1987 Let {αk}\{\alpha_{k}\} and {βk}\{\beta_{k}\} be two sequences of nonnegative real numbers satisfying αk+1αk+βk\alpha_{k+1}\leq\alpha_{k}+\beta_{k} and k=1βk<+\sum\limits_{k=1}^{\infty}\beta_{k}<+\infty. Then the sequence {αk}\{\alpha_{k}\} is convergent as kk\to\infty.

We can now get the convergence results for AA-DFPM.

Lemma 3

Let sequence {xk}\{x_{k}\} be generated by Algorithm 3. Then for any x𝒮x^{*}\in\mathcal{S}, (i) the sequence {xkx}\{\|x_{k}-x^{*}\|\} is convergent; (ii) limkαkdk=0\lim\limits_{k\to\infty}\alpha_{k}\|d_{k}\|=0.

Proof Depending on whether the sequence processes AA or not, we partition the proof into two cases accordingly.

Case I: The iteration does not pass the checking. (i.e. original iteration)

(i) It follows from the monotonicity of FF and (4) that

F(zk)(xkx)\displaystyle F(z_{k})^{\top}(x_{k}-x^{*}) =F(zk)(xkzk)+F(zk)(zkx)\displaystyle=F(z_{k})^{\top}(x_{k}-z_{k})+F(z_{k})^{\top}(z_{k}-x^{*}) (11)
F(zk)(xkzk)=αkF(zk)dk(4)σt1αk2dk2.\displaystyle\geq F(z_{k})^{\top}(x_{k}-z_{k})=-\alpha_{k}F(z_{k})^{\top}d_{k}\overset{\eqref{ak}}{\geq}\sigma t_{1}\alpha_{k}^{2}\|d_{k}\|^{2}.

By the nonexpensive property of P𝒞()P_{\mathcal{C}}(\cdot), for any given x𝒮x^{*}\in\mathcal{S} we have

xk+1x2=\displaystyle\|x_{k+1}-x^{*}\|^{2}= P𝒞[xkζukF(zk)]P𝒞[x]2\displaystyle\|P_{\mathcal{C}}[x_{k}-\zeta u_{k}F(z_{k})]-P_{\mathcal{C}}[x^{*}]\|^{2}
\displaystyle\leq xkζukF(zk)x2\displaystyle\|x_{k}-\zeta u_{k}F(z_{k})-x^{*}\|^{2}
=\displaystyle= xkx22ζukF(zk)(xkx)+ζ2uk2F(zk)2\displaystyle\|x_{k}-x^{*}\|^{2}-2\zeta u_{k}F(z_{k})^{\top}(x_{k}-x^{*})+\zeta^{2}u_{k}^{2}\|F(z_{k})\|^{2}
(11)\displaystyle\overset{\eqref{vz}}{\leq} xkx22ζukF(zk)(xkzk)+ζ2uk2F(zk)2\displaystyle\|x_{k}-x^{*}\|^{2}-2\zeta u_{k}F(z_{k})^{\top}(x_{k}-z_{k})+\zeta^{2}u_{k}^{2}\|F(z_{k})\|^{2}
=\displaystyle= xkx2ζ(2ζ)(F(zk)(xkzk))2F(zk)2\displaystyle\|x_{k}-x^{*}\|^{2}-\zeta(2-\zeta)\frac{(F(z_{k})^{\top}(x_{k}-z_{k}))^{2}}{\|F(z_{k})\|^{2}}
(11)\displaystyle\overset{\eqref{vz}}{\leq} xkx2ζ(2ζ)σ2t12αk4dk4F(zk)2.\displaystyle\|x_{k}-x^{*}\|^{2}-\zeta(2-\zeta)\frac{\sigma^{2}t_{1}^{2}\alpha_{k}^{4}\|d_{k}\|^{4}}{\|F(z_{k})\|^{2}}. (12)

This shows that sequence {xkx}\{\|x_{k}-x^{*}\|\} is nonincreasing and convergent.

(ii) The above result implies {xk}\{x_{k}\} is bounded. This, together with the continuity of FF and (3), shows that {dk}\{d_{k}\} is bounded, further implies {zk}\{z_{k}\} is bounded, as well as {F(zk)}\{F(z_{k})\}. Suppose F(zk)N\|F(z_{k})\|\leq N, and sum (12) for k=0,1,2,k=0,1,2,\cdots, we have

ζ(2ζ)σ2t12N2k=0(αkdk)4\displaystyle\zeta(2-\zeta)\frac{\sigma^{2}t_{1}^{2}}{N^{2}}\sum\limits_{k=0}^{\infty}(\alpha_{k}\|d_{k}\|)^{4} k=0(xkx2xk+1x2)\displaystyle\leq\sum\limits_{k=0}^{\infty}(\|{x}_{k}-x^{*}\|^{2}-\|x_{k+1}-x^{*}\|^{2})
=x0x2limkxk+1x2<.\displaystyle=\|{x}_{0}-x^{*}\|^{2}-\lim\limits_{k\to\infty}\|{x}_{k+1}-x^{*}\|^{2}<\infty.

where 0<ζ<20<\zeta<2. This implies limkαkdk=0\lim\limits_{k\to\infty}\alpha_{k}\|d_{k}\|=0.

Case II: The iteration passes the checking. (i.e. acceleration iteration)

(i) From (8), we have

xk+1x=xka+bk(vkaxka)xxkax+bkvkaxka,\|x_{k+1}-x^{*}\|=\|x_{k}^{a}+b_{k}(v_{k}^{a}-x_{k}^{a})-x^{*}\|\leq\|x_{k}^{a}-x^{*}\|+b_{k}\|v_{k}^{a}-x_{k}^{a}\|,

and

xkax\displaystyle\|x_{k}^{a}-x^{*}\| =j=kmkkajkxjx=vkx+j=kmkkajkxjvk\displaystyle=\left\|\sum\limits_{j=k-{m_{k}}}^{k}a^{k}_{j}x_{j}-x^{*}\right\|=\left\|v_{k}-x^{*}+\sum\limits_{j=k-{m_{k}}}^{k}a^{k}_{j}x_{j}-v_{k}\right\|
vkx+j=kmkkajkxjvk.\displaystyle\leq\|v_{k}-x^{*}\|+\left\|\sum\limits_{j=k-{m_{k}}}^{k}a^{k}_{j}x_{j}-v_{k}\right\|. (13)

Similar to the proof of (12), we get

vkx2xkx2ζ(2ζ)σ2t12αk4dk4F(zk)2,\|v_{k}-x^{*}\|^{2}\leq\|x_{k}-x^{*}\|^{2}-\zeta(2-\zeta)\frac{\sigma^{2}t_{1}^{2}\alpha_{k}^{4}\|d_{k}\|^{4}}{\|F(z_{k})\|^{2}}, (14)

which means that vkxxkx\|v_{k}-x^{*}\|\leq\|x_{k}-x^{*}\|. Hence

xk+1xxkx+j=kmkkajkxjvk+bkvkaxka.\|x_{k+1}-x^{*}\|\leq\|x_{k}-x^{*}\|+\left\|\sum\limits_{j=k-{m_{k}}}^{k}a^{k}_{j}x_{j}-v_{k}\right\|+b_{k}\|v_{k}^{a}-x_{k}^{a}\|.

Using Lemma 2 with αk=xkx\alpha_{k}=\|x_{k}-x^{*}\| and βk=j=kmkkajkxjvk+bkvkaxka\beta_{k}=\left\|\sum\limits_{j=k-{m_{k}}}^{k}a^{k}_{j}x_{j}-v_{k}\right\|+b_{k}\|v_{k}^{a}-x_{k}^{a}\| as well as (7) and (10), the sequence {xkx}\{\|x_{k}-x^{*}\|\} is convergent.

(ii) The above result implies the boundedness of {xk}\{x_{k}\}, further, as well as {F(zk)}\{F(z_{k})\}. Then there exists a constant M>0M>0 such that xkxM\|x_{k}-x^{*}\|\leq M, and

xkax\displaystyle\|x_{k}^{a}-x^{*}\| =j=kmkkajk(xjx)j=kmkk|ajk|xjx\displaystyle=\left\|\sum\limits_{j=k-{m_{k}}}^{k}a^{k}_{j}(x_{j}-x^{*})\right\|\leq\sum\limits_{j=k-{m_{k}}}^{k}|a^{k}_{j}|\|x_{j}-x^{*}\|
(mk+1)M(m+1)M.\displaystyle\leq(m_{k}+1)M\leq(m+1)M. (15)

By employing (13), the definition of the checking (line 5 of Step 5 in Algorithm 3) and vkxxkxM\|v_{k}-x^{*}\|\leq\|x_{k}-x^{*}\|\leq M, we have

xkax2\displaystyle\|x_{k}^{a}-x^{*}\|^{2} (13)(vkx+xkavk)2(vkx+c)2\displaystyle\overset{\eqref{xv}}{\leq}(\|v_{k}-x^{*}\|+\|x_{k}^{a}-v_{k}\|)^{2}\leq(\|v_{k}-x^{*}\|+c)^{2}
=vkx2+2cvkx+c2vkx2+2cM+c2.\displaystyle=\|v_{k}-x^{*}\|^{2}+2c\|v_{k}-x^{*}\|+c^{2}\leq\|v_{k}-x^{*}\|^{2}+2cM+c^{2}. (16)

Therefore, we get

xk+1x2=\displaystyle\|x_{k+1}-x^{*}\|^{2}= xkax+bk(vkaxka)2\displaystyle\|x_{k}^{a}-x^{*}+b_{k}(v_{k}^{a}-x_{k}^{a})\|^{2}
\displaystyle\leq xkax2+(bkvkaxka)2+2bkvkaxkaxkax\displaystyle\|x_{k}^{a}-x^{*}\|^{2}+(b_{k}\|v_{k}^{a}-x_{k}^{a}\|)^{2}+2b_{k}\|v_{k}^{a}-x_{k}^{a}\|\|x_{k}^{a}-x^{*}\|
(15)\displaystyle\overset{\eqref{xaM}}{\leq} xkax2+(bkvkaxka)2+2(m+1)Mbkvkaxka\displaystyle\|x_{k}^{a}-x^{*}\|^{2}+(b_{k}\|v_{k}^{a}-x_{k}^{a}\|)^{2}+2(m+1)Mb_{k}\|v_{k}^{a}-x_{k}^{a}\|
(16)\displaystyle\overset{\eqref{xa2M}}{\leq} vkx2+2cM+c2+(bkvkaxka)2+2(m+1)Mbkvkaxka\displaystyle\|v_{k}-x^{*}\|^{2}+2cM+c^{2}+(b_{k}\|v_{k}^{a}-x_{k}^{a}\|)^{2}+2(m+1)Mb_{k}\|v_{k}^{a}-x_{k}^{a}\|
(14)\displaystyle\overset{\eqref{vx*}}{\leq} xkx2ζ(2ζ)σ2t12αk4dk4F(zk)2+2cM+c2\displaystyle\|x_{k}-x^{*}\|^{2}-\zeta(2-\zeta)\frac{\sigma^{2}t_{1}^{2}\alpha_{k}^{4}\|d_{k}\|^{4}}{\|F(z_{k})\|^{2}}+2cM+c^{2}
+(bkvkaxka)2+2(m+1)Mbkvkaxka.\displaystyle+(b_{k}\|v_{k}^{a}-x_{k}^{a}\|)^{2}+2(m+1)Mb_{k}\|v_{k}^{a}-x_{k}^{a}\|.

Adding above inequality for k=0,1,2,k=0,1,2,\dots, in view of the boundedness of {F(zk)}\{F(z_{k})\} and (10), it follows that

ζ(2ζ)σ2t12N2k=0(αkdk)4\displaystyle\zeta(2-\zeta)\frac{\sigma^{2}t_{1}^{2}}{N^{2}}\sum\limits_{k=0}^{\infty}(\alpha_{k}\|d_{k}\|)^{4}\leq k=0(xkx2xk+1x2+2cM+c2\displaystyle\sum\limits_{k=0}^{\infty}\left(\|{x}_{k}-x^{*}\|^{2}-\|x_{k+1}-x^{*}\|^{2}+2cM+c^{2}\right.
+(bkvkaxka)2+2(m+1)Mbkvkaxka)\displaystyle\left.+(b_{k}\|v_{k}^{a}-x_{k}^{a}\|)^{2}+2(m+1)Mb_{k}\|v_{k}^{a}-x_{k}^{a}\|\right)
=\displaystyle= x0x2limkxk+1x2+2cM+c2\displaystyle\|{x}_{0}-x^{*}\|^{2}-\lim\limits_{k\to\infty}\|{x}_{k+1}-x^{*}\|^{2}+2cM+c^{2}
+k=0(bkvkaxka)2+2(m+1)Mk=0bkvkaxka<\displaystyle+\sum\limits_{k=0}^{\infty}(b_{k}\|v_{k}^{a}-x_{k}^{a}\|)^{2}+2(m+1)M\sum\limits_{k=0}^{\infty}b_{k}\|v_{k}^{a}-x_{k}^{a}\|<\infty

where NN is a bound of F(zk)\|F(z_{k})\|. This implies limkαkdk=0\lim\limits_{k\to\infty}\alpha_{k}\|d_{k}\|=0. ∎

Based on Lemma 3, we prove the global convergence theorem for AA-DFPM.

Theorem 4.1

The sequence {xk}\{x_{k}\} generated by Algorithm 3 converges to a solution of problem (1).

Proof Assume that lim infkFk>0\liminf\limits_{k\to\infty}\|F_{k}\|>0, then there exists a constant ε>0\varepsilon>0 such that

Fkε,k0.\|F_{k}\|\geq\varepsilon,\forall~{}k\geq 0. (17)

Further, from (2) and Cauchy-Schwarz inequality, we have

dks1Fks1ε>0,k0.\|d_{k}\|\geq s_{1}\|F_{k}\|\geq s_{1}\varepsilon>0,~{}\forall~{}k\geq 0.

This together with Lemma 3 (ii) implies

limkαk=0.\lim_{k\to\infty}\alpha_{k}=0. (18)

In view of the boundedness of {xk}\{x_{k}\} and {dk}\{d_{k}\}, there exist two subsequences {xkj}\{x_{k_{j}}\} and {dkj}\{d_{k_{j}}\} such that

limjxkj=x^,limjdkj=d^.\lim\limits_{j\to\infty}x_{k_{j}}=\hat{x},~{}\lim\limits_{j\to\infty}d_{k_{j}}=\hat{d}.

Again, it follows from (2) that

Fkjdkjs1Fkj2,j.-F_{k_{j}}^{\top}d_{k_{j}}\geq s_{1}\|F_{k_{j}}\|^{2},~{}\forall~{}j.

Letting jj\to\infty in the inequality above, and by the continuity of FF and (17), we get

F(x^)d^s1F(x^)2>s1ε2>0.-F(\hat{x})^{\top}\hat{d}\geq s_{1}\|F(\hat{x})\|^{2}>s_{1}\varepsilon^{2}>0. (19)

Similarly, it follows from (4) that

F(xkj+ρ1αkjdkj)dkj<σρ1αkjP[t1,t2][F(xkj+ρ1αkjdkj)]dkj2,j.-F(x_{k_{j}}+\rho^{-1}\alpha_{k_{j}}d_{k_{j}})^{\top}d_{k_{j}}<\sigma\rho^{-1}\alpha_{k_{j}}P_{[t_{1},t_{2}]}[\|F(x_{k_{j}}+\rho^{-1}\alpha_{k_{j}}d_{k_{j}})\|]\|d_{k_{j}}\|^{2},~{}\forall~{}j.

Letting jj\to\infty in the inequality above, taking into account (18) and the continuity of FF, we conclude that F(x^)d^0-F(\hat{x})^{\top}\hat{d}\leq 0, which contradicts (19). Thus,

lim infkFk=0.\liminf\limits_{k\to\infty}\|F_{k}\|=0. (20)

By the boundedness of {xk}\{x_{k}\} and the continuity of FF as well as (20), the sequence {xk}\{x_{k}\} has an accumulation point xx^{*} such that F(x)=0F(x^{*})=0. By xk𝒞x_{k}\in\mathcal{C} and the closeness of 𝒞\mathcal{C}, we have x𝒞x^{*}\in\mathcal{C}, further x𝒮x^{*}\in\mathcal{S}. Combining with the convergence of {xkx}\{\|x_{k}-x^{*}\|\} (Lemma 3 (i)), one knows that the whole sequence {xk}\{x_{k}\} converges to x𝒮x^{*}\in\mathcal{S}. ∎

By Theorem 4.1, we can assume that xkx~𝒮x_{k}\to\tilde{x}\in\mathcal{S} as kk\to\infty. Under mild assumptions below, we further illustrate the linear convergence rate of AA-DFPM.

Assumption 4.1

The mapping FF is Lipschitz continuous on n\mathbb{R}^{n}, i.e., there exists a positive constant LL such that,

F(x)F(y)Lxy,x,yn.\|F(x)-F(y)\|\leq L\|x-y\|,~{}\forall~{}x,y\in\mathbb{R}^{n}. (21)

The Lipschitz continuity assumption on FF helps us to provide a uniform lower bound of the stepsize αk\alpha_{k}. Based on (2), (3) and (21), the proof is similar to that of Lemma 3.4 in WSLZ23 , so we omit it here.

Lemma 4

Suppose that Assumption 4.1 holds. Then the stepsize αk\alpha_{k} yielded by (4) satisfies

αkα:=min{γ,ρs1(L+σt2)s22}>0.\alpha_{k}\geq\alpha:={\rm min}\left\{\gamma,~{}\frac{\rho s_{1}}{(L+\sigma t_{2})s_{2}^{2}}\right\}>0. (22)
Assumption 4.2

For the limit x~𝒮\tilde{x}\in\mathcal{S} of {xk}\{x_{k}\}, suppose that there exist two positive constants \ell and ε\varepsilon such that,

dist(xk,𝒮)Fk,xkB(x~,ε),k=1,2,\ell~{}{\rm dist}(x_{k},\mathcal{S})\leq\|F_{k}\|,~{}\forall~{}x_{k}\in B(\tilde{x},\varepsilon),~{}k=1,2,\dots (23)

where the neighborhood B(x~,ε)={xkn:xkx~<ε}B(\tilde{x},\varepsilon)=\{x_{k}\in\mathbb{R}^{n}:\ \|x_{k}-\tilde{x}\|<\varepsilon\}.

The local error bound Assumption 4.2 is usually used to prove the convergence rate of DFPM in solving (1) (see LF19 ; WSLZ23 ; MJJYH23 ; OL18 for instance). Now we estimate the asymptotic rate of convergence of the iteration, for sufficiently large kk. The sequence {xk}\{x_{k}\} in the proof of the following theorems refers to the acceleration iteration. The convergence rate of the original iteration is identical to Theorem 4.5 in OL23 .

Theorem 4.2

Suppose that Assumptions 4.1 and 4.2 hold, the sequence {xk}\{x_{k}\} is generated by Algorithm 3. Then {dist(xk,𝒮)}\{{\rm dist}(x_{k},\mathcal{S})\} satisfies

dist(xk+1,𝒮)dist(xk,𝒮)φ+(c+1)k(1+ϵ)dist(xk,𝒮).\frac{{\rm dist}(x_{k+1},\mathcal{S})}{{\rm dist}(x_{k},\mathcal{S})}\leq\sqrt{\varphi}+(c+1)\frac{k^{-(1+\epsilon)}}{{\rm dist}(x_{k},\mathcal{S})}.

Proof Let hk𝒮h_{k}\in\mathcal{S} be the closest solution to xkx_{k}, i.e., xkhk=dist(xk,𝒮)\left\|x_{k}-h_{k}\right\|={\rm dist}(x_{k},\mathcal{S}). Recall (14) that

vkhk2\displaystyle\|v_{k}-h_{k}\|^{2} xkhk2ζ(2ζ)σ2t12αk4dk4F(zk)2\displaystyle\leq\|x_{k}-h_{k}\|^{2}-\zeta(2-\zeta)\frac{\sigma^{2}t_{1}^{2}\alpha_{k}^{4}\|d_{k}\|^{4}}{\|F(z_{k})\|^{2}}
=dist2(xk,𝒮)ζ(2ζ)σ2t12αk4dk4F(zk)2.\displaystyle={\rm dist}^{2}(x_{k},\mathcal{S})-\zeta(2-\zeta)\frac{\sigma^{2}t_{1}^{2}\alpha_{k}^{4}\|d_{k}\|^{4}}{\|F(z_{k})\|^{2}}. (24)

From (21), (3) and 0<αkγ0<\alpha_{k}\leq\gamma, it follows from that

F(zk)\displaystyle\|F(z_{k})\| =F(zk)F(hk)(21)LzkhkL(xkzk+xkhk)\displaystyle=\|F(z_{k})-F(h_{k})\|\overset{\eqref{LC}}{\leq}L\|z_{k}-h_{k}\|\leq L(\|x_{k}-z_{k}\|+\|x_{k}-h_{k}\|)
=L(αkdk+xkhk)L(γdk+xkhk)\displaystyle=L(\alpha_{k}\|d_{k}\|+\|x_{k}-h_{k}\|)\leq L(\gamma\|d_{k}\|+\|x_{k}-h_{k}\|)
(3)L(γs2Fk+xkhk)=L(γs2FkF(hk)+xkhk)\displaystyle\overset{\eqref{TRP}}{\leq}L(\gamma s_{2}\|F_{k}\|+\|x_{k}-h_{k}\|)=L(\gamma s_{2}\|F_{k}-F(h_{k})\|+\|x_{k}-h_{k}\|)
(21)L(γLs2+1)xkhk=L(γLs2+1)dist(xk,𝒮)ϱdist(xk,𝒮),\displaystyle\overset{\eqref{LC}}{\leq}L(\gamma Ls_{2}+1)\|x_{k}-h_{k}\|=L(\gamma Ls_{2}+1){\rm dist}(x_{k},\mathcal{S})\leq\varrho{\rm dist}(x_{k},\mathcal{S}), (25)

where ϱ:=max{L(γLs2+1),ζ(2ζ)σt1α2s122}\varrho:=\max\{L(\gamma Ls_{2}+1),\sqrt{\zeta(2-\zeta)}\sigma t_{1}\alpha^{2}s_{1}^{2}\ell^{2}\}. Again, from (2), (22) and (23), we have

αk4dk4α4s14Fk4(23)α4s144dist4(xk,𝒮).\alpha_{k}^{4}\|d_{k}\|^{4}\geq\alpha^{4}s_{1}^{4}\|F_{k}\|^{4}\overset{\eqref{1.13}}{\geq}\alpha^{4}s_{1}^{4}\ell^{4}{\rm dist}^{4}(x_{k},\mathcal{S}). (26)

Combining with (24)-(26), we obtain

vkhk2φdist2(xk,𝒮),φ=1ζ(2ζ)(σt1α2s122ϱ)2(0,1).\displaystyle\|v_{k}-h_{k}\|^{2}\leq\varphi{\rm dist}^{2}(x_{k},\mathcal{S}),~{}\varphi=1-\zeta(2-\zeta)\left(\frac{\sigma t_{1}\alpha^{2}s_{1}^{2}\ell^{2}}{\varrho}\right)^{2}\in(0,1).

This, together with xkavkck(1+ϵ)\|x_{k}^{a}-v_{k}\|\leq ck^{-(1+\epsilon)} and bkvkaxkak(1+ϵ)b_{k}\|v_{k}^{a}-x_{k}^{a}\|\leq k^{-(1+\epsilon)}, shows that

dist(xk+1,𝒮)\displaystyle{\rm dist}(x_{k+1},\mathcal{S}) xk+1hk\displaystyle\leq\|x_{k+1}-h_{k}\|
=vkhk+xkavk+bk(vkaxka)\displaystyle=\|v_{k}-h_{k}+x_{k}^{a}-v_{k}+b_{k}(v_{k}^{a}-x_{k}^{a})\|
vkhk+xkavk+bkvkaxka\displaystyle\leq\|v_{k}-h_{k}\|+\|x_{k}^{a}-v_{k}\|+b_{k}\|v_{k}^{a}-x_{k}^{a}\|
φdist(xk,𝒮)+(c+1)k(1+ϵ).\displaystyle\leq\sqrt{\varphi}{\rm dist}(x_{k},\mathcal{S})+(c+1)k^{-(1+\epsilon)}. (27)

Hence

dist(xk+1,𝒮)dist(xk,𝒮)φ+(c+1)k(1+ϵ)dist(xk,𝒮).\frac{{\rm dist}(x_{k+1},\mathcal{S})}{{\rm dist}(x_{k},\mathcal{S})}\leq\sqrt{\varphi}+(c+1)\frac{k^{-(1+\epsilon)}}{{\rm dist}(x_{k},\mathcal{S})}.

The proof is completed. ∎

Let a:=lim supkk(1+ϵ)dist(xk,𝒮)a:=\limsup\limits_{k\to\infty}\frac{k^{-(1+\epsilon)}}{{{\rm dist}(x_{k},\mathcal{S})}}. Theorem 4.2 further implies the asymptotic convergence rate of sequence {dist(xk,𝒮)}\{{\rm dist}(x_{k},\mathcal{S})\}.

Corollary 1

Suppose that Assumptions 4.1 and 4.2 hold, the sequence {xk}\{x_{k}\} is generated by Algorithm 3. Then the three following claims hold.

(i) If 0<a<+0<a<+\infty, then dist(xk,𝒮)=O(k(1+ϵ)){{\rm dist}(x_{k},\mathcal{S})}=O(k^{-(1+\epsilon)});

(ii) If a=+a=+\infty, then dist(xk,𝒮)=o(k(1+ϵ)){{\rm dist}(x_{k},\mathcal{S})}=o(k^{-(1+\epsilon)});

(iii) If a=0a=0, then the sequence {dist(xk,𝒮)}\{{\rm dist}(x_{k},\mathcal{S})\} converges Q-linearly to 0, i.e.,

lim supkdist(xk+1,𝒮)dist(xk,𝒮)<1.\limsup\limits_{k\to\infty}\frac{{\rm dist}(x_{k+1},\mathcal{S})}{{\rm dist}(x_{k},\mathcal{S})}<1.

This result is consistent with the convergence rate of the inertial-type DFPM WSLZ23 ; MJJYH23 . We further investigate the convergence rate of sequence {xk}\{x_{k}\} if the mapping FF is strongly monotone with modulus μ>0\mu>0, i.e.,

(F(x)F(y))(xy)μxy2,x,yn.(F(x)-F(y))^{\top}(x-y)\geq\mu\|x-y\|^{2},~{}\forall~{}x,~{}y\in\mathbb{R}^{n}.
Theorem 4.3

Suppose that Assumptions 4.1 and 4.2 hold, the sequence {xk}\{x_{k}\} is generated by Algorithm 3. If the mapping FF is strongly monotone, then {xkx}\{\|x_{k}-x^{*}\|\} satisfies

xk+1xxkxψ+(c+1)k(1+ϵ)xkx.\frac{\|x_{k+1}-x^{*}\|}{\|x_{k}-x^{*}\|}\leq\sqrt{\psi}+(c+1)\frac{k^{-(1+\epsilon)}}{\|x_{k}-x^{*}\|}.

proof By the Cauchy-Schwarz inequality and the strong monotonicity of FF, it has

Fk=FkF(x)μxkx.\|F_{k}\|=\|F_{k}-F(x^{*})\|\geq\mu\|x_{k}-x^{*}\|.

Together with (2) and (22), we have

αk4dk4α4s14Fk4α4s14μ4xkx4.\alpha_{k}^{4}\|d_{k}\|^{4}\geq\alpha^{4}s_{1}^{4}\|F_{k}\|^{4}\geq\alpha^{4}s_{1}^{4}\mu^{4}\|x_{k}-x^{*}\|^{4}. (28)

Similar to the proof of (25), it follows that

F(zk)L(γLs2+1)xkxξxkx,\|F(z_{k})\|\leq L(\gamma Ls_{2}+1)\|x_{k}-x^{*}\|\leq\xi\|x_{k}-x^{*}\|, (29)

where ξ:=max{L(γLs2+1),ζ(2ζ)σt1α2s12μ2}\xi:=\max\{L(\gamma Ls_{2}+1),\sqrt{\zeta(2-\zeta)}\sigma t_{1}\alpha^{2}s_{1}^{2}\mu^{2}\}. Combining (14) with (28) and (29) implies

vkx2ψxkx2,ψ=1ζ(2ζ)(σt1α2s12μ2ξ)2(0,1).\displaystyle\|v_{k}-x^{*}\|^{2}\leq\psi\|x_{k}-x^{*}\|^{2},~{}\psi=1-\zeta(2-\zeta)\left(\frac{\sigma t_{1}\alpha^{2}s_{1}^{2}\mu^{2}}{\xi}\right)^{2}\in(0,1).

Also similar to the proof of (27), we obtain

xk+1x\displaystyle\|x_{k+1}-x^{*}\| =vkx+xkavk+bk(vkaxka)\displaystyle=\|v_{k}-x^{*}+x_{k}^{a}-v_{k}+b_{k}(v_{k}^{a}-x_{k}^{a})\|
vkx+xkavk+bkvkaxka\displaystyle\leq\|v_{k}-x^{*}\|+\|x_{k}^{a}-v_{k}\|+b_{k}\|v_{k}^{a}-x_{k}^{a}\|
ψxkx+(c+1)k(1+ϵ).\displaystyle\leq\sqrt{\psi}\|x_{k}-x^{*}\|+(c+1)k^{-(1+\epsilon)}.

Thus

xk+1xxkxψ+(c+1)k(1+ϵ)xkx.\frac{\|x_{k+1}-x^{*}\|}{\|x_{k}-x^{*}\|}\leq\sqrt{\psi}+(c+1)\frac{k^{-(1+\epsilon)}}{\|x_{k}-x^{*}\|}.

The proof is completed. ∎

Let A:=lim supkk(1+ϵ)xkxA:=\limsup\limits_{k\to\infty}\frac{k^{-(1+\epsilon)}}{\|x_{k}-x^{*}\|}. We can also get the asymptotic convergence rate of sequence {xk}\{x_{k}\} from Theorem 4.3.

Corollary 2

Suppose that Assumptions 4.1 and 4.2 hold, the sequence {xk}\{x_{k}\} is generated by Algorithm 3. Then the following statements hold.

(i) If 0<A<+0<A<+\infty, then xkx=O(k(1+ϵ))\|x_{k}-x^{*}\|=O(k^{-(1+\epsilon)});

(ii) If A=+A=+\infty, then xkx=o(k(1+ϵ))\|x_{k}-x^{*}\|=o(k^{-(1+\epsilon)});

(iii) If A=0A=0, then the sequence {xk}\{x_{k}\} linearly converges to x𝒮x^{*}\in\mathcal{S}, i.e.,

lim supkxk+1xxkx<1.\limsup\limits_{k\to\infty}\frac{\|x_{k+1}-x^{*}\|}{\|x_{k}-x^{*}\|}<1.

5 Numerical experiments and applications

In this section, we demonstrate the effectiveness of AA-DFPM through experiments on constrained nonlinear equations as well as a real-world problem of machine learning. All tests are conducted in MATLAB R2016b on a 64-bit Lenovo laptop with Intel(R) Core(TM) i7-6700HQ CPU (2.60 GHz), 16.00 GB RAM and Windows 10 OS. Throughout the numerical experiments, three search directions are chosen as follows:

1) Spectral conjugate gradient projection (SCGP) method WSLZ23

dk={Fk,k=0,θkFk+βkdk1,k1andθk[ϑ1,ϑ2],Fk+ζFkdk1dk1,k1andθk[ϑ1,ϑ2],d_{k}=\left\{\begin{array}[]{ll}-F_{k},&k=0,\\ -\theta_{k}F_{k}+\beta_{k}d_{k-1},&k\geq 1~{}{\rm and}~{}\theta_{k}\in[\vartheta_{1},\vartheta_{2}],\\ -F_{k}+\zeta\frac{\|F_{k}\|}{\|d_{k-1}\|}d_{k-1},&k\geq 1~{}{\rm and}~{}\theta_{k}\notin[\vartheta_{1},\vartheta_{2}],\end{array}\right. (30)

where

βk=max{Fkηk1dk1vk1ηk12Fkdk1(dk1vk1)2,χFkdk1dk12},ηk1=yk1+τkFk,τk=τyk1Fk+min{0,Fkyk1Fk2},vk1=yk1+λkdk1,λk=yk1dk1+max{0,dk1yk1dk12},θk=sk1Fk+βkyk1dk1Fkyk1,sk1=xkxk1,yk1=FkFk1,\begin{gathered}\beta_{k}=\max\left\{\frac{F_{k}^{\top}\eta_{k-1}}{d_{k-1}^{\top}v_{k-1}}-\frac{\|\eta_{k-1}\|^{2}F_{k}^{\top}d_{k-1}}{(d_{k-1}^{\top}v_{k-1})^{2}},\chi\frac{F_{k}^{\top}d_{k-1}}{\|d_{k-1}\|^{2}}\right\},\\ \eta_{k-1}=y_{k-1}+\tau_{k}F_{k},\tau_{k}=\tau\frac{\|y_{k-1}\|}{\|F_{k}\|}+\min\left\{0,\frac{-F_{k}^{\top}y_{k-1}}{\|F_{k}\|^{2}}\right\},\\ v_{k-1}=y_{k-1}+\lambda_{k}d_{k-1},\lambda_{k}=\frac{\|y_{k-1}\|}{\|d_{k-1}\|}+\max\left\{0,\frac{-d_{k-1}^{\top}y_{k-1}}{\|d_{k-1}\|^{2}}\right\},\\ \theta_{k}=\frac{s_{k-1}^{\top}F_{k}+\beta_{k}y_{k-1}^{\top}d_{k-1}}{F_{k}^{\top}y_{k-1}},~{}s_{k-1}=x_{k}-x_{k-1},~{}y_{k-1}=F_{k}-F_{k-1},\\ \end{gathered}

and χ(0,14)\chi\in(0,\frac{1}{4}), ζ[0,1)\zeta\in[0,1), τ>0\tau>0, 14<ϑ1<ϑ2\frac{1}{4}<\vartheta_{1}<\vartheta_{2}. From Lemma 3.1 in WSLZ23 , direction (30) satisfies conditions (2) and (3).

2) Hybrid three-term conjugate gradient projection (HTTCGP) method YJJLW21

dk={Fk,k=0,Fk+βkdk1+υ~kyk1,k1,d_{k}=\left\{\begin{array}[]{ll}-F_{k},{}&k=0,\\ -F_{k}+\beta_{k}d_{k-1}+\tilde{\upsilon}_{k}y_{k-1},{}&k\geq 1,\end{array}\right. (31)

where

βk=Fkyk1τkyk12Fkdk1τk2,υ~k=δkFkdk1τk,τk=max{μdk1yk1,dk1yk1,Fk12},\begin{gathered}\beta_{k}=\frac{F_{k}^{\top}y_{k-1}}{\tau_{k}}-\frac{\|y_{k-1}\|^{2}F_{k}^{\top}d_{k-1}}{\tau_{k}^{2}},~{}\tilde{\upsilon}_{k}=\delta_{k}\frac{F_{k}\top d_{k-1}}{\tau_{k}},\\ \tau_{k}=\max\{\mu\|d_{k-1}\|\|y_{k-1}\|,~{}d_{k-1}^{\top}y_{k-1},\|F_{k-1}\|^{2}\},\end{gathered}

with parameters μ>0\mu>0 and 0δkδ<10\leq\delta_{k}\leq\delta<1. From Lemma 2 in YJJLW21 , direction (31) satisfies conditions (2) and (3).

3) Modified spectral three-term conjugate gradient method AF23 (Considering that the direction was originally designed for a conjugate gradient method for solving unconstrained problems, here we have adapted it slightly to accommodate DFPM, named MSTTCGP.)

dk={Fk,k=0,θkFk+βkdk1υ~kyk1,k1andθk[ϑ1,ϑ2],Fk+βkdk1υ~kyk1,k1andθk[ϑ1,ϑ2],d_{k}=\left\{\begin{array}[]{ll}-F_{k},&k=0,\\ -\theta_{k}F_{k}+\beta_{k}d_{k-1}-\tilde{\upsilon}_{k}y_{k-1},&k\geq 1~{}{\rm and}~{}\theta_{k}\in[\vartheta_{1},\vartheta_{2}],\\ -F_{k}+\beta_{k}d_{k-1}-\tilde{\upsilon}_{k}y_{k-1},&k\geq 1~{}{\rm and}~{}\theta_{k}\notin[\vartheta_{1},\vartheta_{2}],\end{array}\right. (32)

where

βk=Fkyk1τk,υ~k=Fkdk1τk,θk=sk1Fk+βkyk1dk1υ~kyk12Fkyk1,τk=max{μdk1yk1,dk1yk1,Fk12},\begin{gathered}\beta_{k}=\frac{F_{k}^{\top}y_{k-1}}{\tau_{k}},\tilde{\upsilon}_{k}=\frac{F_{k}^{\top}d_{k-1}}{\tau_{k}},\theta_{k}=\frac{s_{k-1}^{\top}F_{k}+\beta_{k}y_{k-1}^{\top}d_{k-1}-\tilde{\upsilon}_{k}\|y_{k-1}\|^{2}}{F_{k}^{\top}y_{k-1}},\\ \tau_{k}=\max\{\mu\|d_{k-1}\|\|y_{k-1}\|,d_{k-1}^{\top}y_{k-1},\|F_{k-1}\|^{2}\},\end{gathered}

in which ϑ2>ϑ1>0\vartheta_{2}>\vartheta_{1}>0 and μ>0\mu>0. From Corollary 3.1 in AF23 , it follows that (32) satisfies condition (2). We prove that (32) satisfies condition (3). To proceed, by the definitions of parameters θk\theta_{k}, βk\beta_{k} and υ~k\tilde{\upsilon}_{k}, we get

dk\displaystyle\|d_{k}\| =θkFk+βkdk1υ~kyk1\displaystyle=\|\theta_{k}F_{k}+\beta_{k}d_{k-1}-\tilde{\upsilon}_{k}y_{k-1}\|
θkFk+|βk|dk1+|υ~k|yk1\displaystyle\leq\theta_{k}\|F_{k}\|+|\beta_{k}|\|d_{k-1}\|+|\tilde{\upsilon}_{k}|\|y_{k-1}\|
=θkFk+|Fkyk1|dk1τk+|Fkdk1|yk1τk\displaystyle=\theta_{k}\|F_{k}\|+\frac{|F_{k}^{\top}y_{k-1}|\|d_{k-1}\|}{\tau_{k}}+\frac{|F_{k}^{\top}d_{k-1}|\|y_{k-1}\|}{\tau_{k}}
θkFk+Fkyk1dk1μdk1yk1+Fkdk1yk1μdk1yk1\displaystyle\leq\theta_{k}\|F_{k}\|+\frac{\|F_{k}\|\|y_{k-1}\|\|d_{k-1}\|}{\mu\|d_{k-1}\|\|y_{k-1}\|}+\frac{\|F_{k}\|\|d_{k-1}\|\|y_{k-1}\|}{\mu\|d_{k-1}\|\|y_{k-1}\|}
=(ϑ2+2μ)Fk\displaystyle=\left(\vartheta_{2}+\frac{2}{\mu}\right)\|F_{k}\|

for k1k\geq 1 and θk[ϑ1,ϑ2]\theta_{k}\in[\vartheta_{1},\vartheta_{2}], and

dk=θkFk+βkdk1υ~kwk1(1+2μ)Fk\|d_{k}\|=\|\theta_{k}F_{k}+\beta_{k}d_{k-1}-\tilde{\upsilon}_{k}w_{k-1}\|\leq\left(1+\frac{2}{\mu}\right)\|F_{k}\|

for k1k\geq 1 and θk[ϑ1,ϑ2]\theta_{k}\notin[\vartheta_{1},\vartheta_{2}]. Thus dks2Fk,s2:=max{1,ϑ2}+2μ\|d_{k}\|\leq s_{2}\|F_{k}\|,~{}s_{2}:=\max\{1,\vartheta_{2}\}+\frac{2}{\mu}.

All related parameters of SCGP, HTTCGP and MSTTCGP are the same as their originals. In addition, we set the line search and the projection parameters σ=0.01\sigma=0.01, γ=1\gamma=1, ρ=0.6\rho=0.6, ξ=1.7\xi=1.7, t1=0.001t_{1}=0.001, t2=0.4t_{2}=0.4 for MSTTCGP. We use AA-SCGP, AA-HTTCGP and AA-MSTTCGP to denote their Anderson acceleration variant with the AA parameters m=3m=3, c=10c=10, b=0.1b=0.1 and λ=1010\lambda=10^{-10}. During the implementation, the stopping criterion in all algorithms is as Fkϵ=106\|F_{k}\|\leq\epsilon=10^{-6}, or the number of iterations exceed 2,000.

5.1 Large-scale nonlinear equations

We first test these algorithms on the standard constrained nonlinear equations with different dimensions. To assess the effectiveness of these algorithms objectively, we conduct tests for each problem using initial points randomly generated from the interval (0,1)(0,1). The numerical results, obtained from running each test 10 times with each algorithm, are presented in Table 1, where “ P(n)/Iter¯\overline{\rm Iter}/NF¯\overline{\rm NF}/Tcpu¯\overline{\rm Tcpu}/F¯\overline{\|F^{\ast}\|}” stand for test problems (problem dimensions), average number of iterations, average number of evaluations of FF, average CPU time in seconds, average final value of Fk\|F_{k}\| when the program is stopped, respectively. The following test Problems 1-4 are respectively selected as the same as Problems 1, 3, 5 and 7 in WSLZ23 . The convex constraints of these problems are 𝒞=+n\mathcal{C}=\mathbb{R}_{+}^{n} and the mapping FF is defined as

F(x)=(f1(x),f2(x),,fn(x)).F(x)=\left(f_{1}(x),f_{2}(x),\cdots,f_{n}(x)\right)^{\top}.

Problem 1.

fi(x)=exi1,i=1,2,,n.f_{i}(x)=e^{x_{i}}-1,~{}i=1,2,\cdots,n.

Problem 2.

fi(x)=ln(xi+1)xin,i=1,2,,n.f_{i}(x)=\ln(x_{i}+1)-\frac{x_{i}}{n},~{}i=1,2,\cdots,n.

Problem 3.

f1(x)=ex11,fi(x)=exi+xi1,i=2,3,,n.f_{1}(x)=e^{x_{1}}-1,~{}f_{i}(x)=e^{x_{i}}+x_{i}-1,~{}i=2,3,\cdots,n.

Problem 4.

fi(x)=2xisin(xi),i=1,2,,n.f_{i}(x)=2x_{i}-\sin(x_{i}),~{}i=1,2,\cdots,n.
Refer to caption
(a) Performance profiles on Iter¯\overline{\rm Iter}
Refer to caption
(b) Performance profiles on NF¯\overline{\rm NF}
Figure 1: Performance profiles of these methods for constrained nonlinear equations
Table 1: Numerical results on Problem 1-4 with random initial points
P(n) MSTTCGP AA-MSTTCGP HTTCGP AA-HTTCGP SCGP AA-SCGP
Iter¯\overline{\rm Iter}/NF¯\overline{\rm NF}/Tcpu¯\overline{\rm Tcpu}/F¯\overline{\|F^{\ast}\|} Iter¯\overline{\rm Iter}/NF¯\overline{\rm NF}/Tcpu¯\overline{\rm Tcpu}/F¯\overline{\|F^{\ast}\|} Iter¯\overline{\rm Iter}/NF¯\overline{\rm NF}/Tcpu¯\overline{\rm Tcpu}/F¯\overline{\|F^{\ast}\|} Iter¯\overline{\rm Iter}/NF¯\overline{\rm NF}/Tcpu¯\overline{\rm Tcpu}/F¯\overline{\|F^{\ast}\|} Iter¯\overline{\rm Iter}/NF¯\overline{\rm NF}/Tcpu¯\overline{\rm Tcpu}/F¯\overline{\|F^{\ast}\|} Iter¯\overline{\rm Iter}/NF¯\overline{\rm NF}/Tcpu¯\overline{\rm Tcpu}/F¯\overline{\|F^{\ast}\|}
1(10000) 38.4/113.8/0.051/5.27e-07 5.0/23.0/0.048/0.00e+00 11.0/29.1/0.015/3.17e-07 4.0/14.0/0.029/0.00e+00 14.0/29.0/0.020/6.56e-07 9.0/29.0/0.053/0.00e+00
1(30000) 44.8/135.0/0.135/5.78e-07 9.1/32.9/0.077/0.00e+00 8.8/22.7/0.024/3.37e-07 7.0/22.0/0.056/0.00e+00 15.0/31.0/0.038/3.34e-07 9.0/29.0/0.076/0.00e+00
1(50000) 43.6/131.2/0.216/6.38e-07 9.7/30.7/0.104/0.00e+00 11.8/31.0/0.051/4.13e-07 7.0/22.0/0.078/0.00e+00 15.0/31.0/0.068/4.31e-07 3.0/9.0/0.033/0.00e+00
1(80000) 42.7/128.4/0.298/4.62e-07 12.2/39.3/0.195/0.00e+00 10.9/28.7/0.076/3.93e-07 7.0/21.0/0.114/1.99e-09 15.0/31.0/0.100/5.45e-07 3.0/9.0/0.040/0.00e+00
1(100000) 40.6/125.4/0.358/5.48e-07 13.1/39.9/0.250/0.00e+00 11.8/31.3/0.094/2.44e-07 7.0/21.0/0.139/8.52e-10 15.0/31.0/0.123/6.08e-07 5.0/13.0/0.092/0.00e+00
1(120000) 45.4/138.5/0.501/4.22e-07 13.1/39.4/0.275/0.00e+00 11.0/28.3/0.110/2.60e-07 7.0/21.0/0.159/3.46e-10 15.0/31.0/0.144/6.67e-07 5.0/13.0/0.111/0.00e+00
1(150000) 47.3/142.4/0.750/6.23e-07 13.0/39.0/0.352/0.00e+00 11.6/30.7/0.178/3.79e-07 7.0/21.0/0.198/4.15e-14 15.0/31.0/0.247/7.47e-07 5.7/14.4/0.165/0.00e+00
1(180000) 44.2/130.6/0.802/5.02e-07 12.4/37.2/0.398/0.00e+00 12.7/33.7/0.230/3.19e-07 7.0/21.0/0.243/2.76e-11 15.0/31.0/0.280/8.17e-07 12.0/27.0/0.452/0.00e+00
1(200000) 46.0/137.9/0.924/4.91e-07 12.0/36.0/0.429/0.00e+00 11.2/29.2/0.223/3.85e-07 7.0/21.0/0.272/6.60e-15 15.0/31.0/0.327/8.61e-07 12.0/27.0/0.481/0.00e+00
1(250000) 44.1/132.7/1.124/5.09e-07 12.0/36.0/0.514/0.00e+00 11.0/28.5/0.267/2.31e-07 7.0/21.0/0.324/2.58e-14 15.0/31.0/0.396/9.63e-07 12.0/27.0/0.581/0.00e+00
2(10000) 25.3/81.8/0.024/3.10e-07 8.8/30.3/0.051/0.00e+00 10.9/29.7/0.009/5.77e-07 7.2/19.4/0.040/0.00e+00 14.0/29.0/0.012/3.22e-07 8.0/19.0/0.045/0.00e+00
2(30000) 28.4/91.6/0.049/6.15e-07 9.2/31.4/0.064/0.00e+00 10.0/27.0/0.016/0.00e+00 7.3/18.6/0.047/2.98e-07 14.0/29.0/0.022/5.61e-07 7.0/17.0/0.050/0.00e+00
2(50000) 24.4/82.0/0.067/3.42e-07 14.2/49.1/0.136/0.00e+00 10.0/27.0/0.024/0.00e+00 3.0/10.0/0.022/0.00e+00 14.0/29.0/0.032/7.23e-07 7.0/17.0/0.060/0.00e+00
2(80000) 21.1/71.0/0.091/3.96e-07 11.6/46.5/0.157/0.00e+00 10.0/27.0/0.037/0.00e+00 3.0/10.0/0.028/0.00e+00 14.0/29.0/0.050/9.14e-07 7.0/17.0/0.085/0.00e+00
2(100000) 29.3/99.5/0.162/2.19e-07 13.1/43.9/0.216/0.00e+00 10.0/27.0/0.049/0.00e+00 3.0/10.0/0.038/0.00e+000 15.0/31.0/0.069/3.09e-07 7.0/17.0/0.108/0.00e+00
2(120000) 23.5/80.1/0.159/1.49e-07 9.6/37.4/0.183/0.00e+00 10.0/27.0/0.057/0.00e+00 3.0/10.0/0.043/0.00e+00 15.0/31.0/0.083/3.38e-07 7.0/17.0/0.119/0.00e+00
2(150000) 27.2/87.5/0.339/2.94e-07 12.7/46.4/0.366/0.00e+00 10.0/27.0/0.118/0.00e+00 5.0/16.0/0.125/0.00e+00 15.0/31.0/0.184/3.79e-07 7.0/17.0/0.192/0.00e+00
2(180000) 32.0/103.2/0.475/4.15e-07 11.5/40.9/0.380/0.00e+00 10.0/27.0/0.140/0.00e+00 5.0/16.0/0.147/0.00e+00 15.0/31.0/0.220/4.15e-07 7.0/17.0/0.222/0.00e+00
2(200000) 32.2/105.2/0.530/4.73e-07 11.8/40.4/0.428/0.00e+00 10.0/27.0/0.153/0.00e+00 5.0/16.0/0.159/0.00e+00 15.0/31.0/0.241/4.37e-07 7.0/17.0/0.244/0.00e+00
2(250000) 32.7/108.2/0.676/4.19e-07 14.2/48.3/0.631/2.44e-08 10.0/27.0/0.190/0.00e+00 5.0/16.0/0.194/0.00e+00 15.0/31.0/0.293/4.89e-07 7.0/17.0/0.293/0.00e+00
3(10000) 17.4/69.0/0.030/6.67e-08 6.0/31.0/0.040/0.00e+00 12.0/40.7/0.018/1.78e-07 6.0/25.0/0.038/3.70e-14 20.0/61.0/0.030/7.03e-07 5.0/19.0/0.030/0.00e+00
3(30000) 17.6/69.5/0.066/7.75e-08 6.0/30.0/0.057/0.00e+00 12.0/40.4/0.041/1.56e-07 5.0/21.0/0.044/9.24e-10 21.0/64.0/0.075/4.90e-07 8.0/28.0/0.075/0.00e+00
3(50000) 17.8/70.8/0.109/2.09e-15 6.0/30.0/0.078/0.00e+00 12.0/40.5/0.058/1.63e-07 10.8/43.1/0.145/2.04e-08 21.0/64.0/0.111/6.30e-07 6.0/30.0/0.087/0.00e+00
3(80000) 18.8/74.4/0.177/1.63e-07 6.0/30.0/0.106/0.00e+00 12.0/40.2/0.087/2.42e-07 10.2/41.8/0.192/3.18e-08 21.0/64.0/0.167/7.99e-07 6.0/30.0/0.122/0.00e+00
3(100000) 12.2/49.5/0.141/8.46e-08 6.0/30.0/0.126/0.00e+00 12.0/40.4/0.112/2.69e-07 8.0/34.2/0.184/0.00e+00 21.0/64.0/0.200/8.91e-07 6.0/30.0/0.152/0.00e+00
3(120000) 23.9/97.1/0.333/1.73e-07 6.0/30.0/0.153/0.00e+00 12.0/40.0/0.137/3.26e-07 7.5/32.2/0.193/0.00e+00 21.0/64.0/0.255/9.76e-07 6.0/30.0/0.175/0.00e+00
3(150000) 16.1/64.8/0.344/8.86e-08 6.0/30.0/0.211/0.00e+00 12.0/40.3/0.210/3.02e-07 7.4/31.7/0.261/6.99e-09 22.0/67.6/0.421/5.59e-07 6.0/30.0/0.242/0.00e+00
3(180000) 15.0/61.2/0.359/4.86e-08 6.0/30.0/0.245/0.00e+00 12.0/40.2/0.254/3.68e-07 7.5/31.7/0.303/0.00e+00 22.0/67.7/0.491/5.19e-07 7.0/33.0/0.334/0.00e+00
3(200000) 17.4/70.2/0.464/9.00e-08 6.0/30.0/0.268/0.00e+00 12.0/40.0/0.288/3.76e-07 7.8/32.8/0.325/1.72e-11 22.0/68.2/0.567/5.97e-07 7.0/33.0/0.350/0.00e+00
3(250000) 16.1/63.3/0.538/1.60e-08 6.0/30.0/0.355/0.00e+00 12.0/40.1/0.349/4.54e-07 9.8/40.6/0.531/0.00e+00 22.0/67.7/0.694/6.13e-07 4.0/16.0/0.214/0.00e+00
4(10000) 8.5/27.5/0.006/4.99e-07 2.0/11.4/0.006/0.00e+00 7.0/18.0/0.004/2.62e-07 2.4/9.1/0.008/0.00e+00 2.0/5.0/0.003/0.00e+00 1.0/5.0/0.002/0.00e+00
4(30000) 10.2/31.7/0.010/4.81e-07 2.0/12.6/0.008/0.00e+00 7.0/18.0/0.007/4.50e-07 6.0/18.0/0.035/3.36e-09 2.0/5.0/0.005/0.00e+00 1.0/5.0/0.002/0.00e+00
4(50000) 11.3/35.1/0.016/4.08e-07 2.0/12.2/0.010/0.00e+00 7.0/18.0/0.009/5.61e-07 6.0/18.0/0.042/2.95e-10 2.0/5.0/0.007/0.00e+00 1.0/5.0/0.003/0.00e+00
4(80000) 12.1/37.7/0.027/4.03e-07 2.0/11.4/0.015/0.00e+00 7.0/18.0/0.015/7.25e-07 6.0/18.0/0.054/2.96e-10 2.0/5.0/0.011/0.00e+00 1.0/5.0/0.005/0.00e+00
4(100000) 12.6/39.1/0.035/5.64e-07 2.0/12.2/0.019/0.00e+00 7.0/18.0/0.019/8.38e-07 6.0/18.0/0.068/2.87e-10 2.0/5.0/0.014/0.00e+00 1.0/5.0/0.007/0.00e+00
4(120000) 12.8/40.4/0.045/5.24e-07 2.0/12.6/0.024/0.00e+00 7.0/18.0/0.023/9.00e-07 6.0/18.0/0.079/2.81e-10 2.0/5.0/0.017/0.00e+00 1.0/5.0/0.009/0.00e+00
4(150000) 11.8/37.3/0.089/4.04e-07 2.0/11.8/0.037/0.00e+00 7.6/19.8/0.055/5.14e-07 6.0/18.0/0.116/5.67e-10 2.0/5.0/0.030/0.00e+00 1.0/5.0/0.017/0.00e+00
4(180000) 12.5/38.3/0.110/3.37e-07 2.0/12.2/0.046/0.00e+00 8.0/21.0/0.070/2.25e-07 6.0/18.0/0.140/7.78e-10 2.0/5.0/0.036/0.00e+00 1.0/5.0/0.020/0.00e+00
4(200000) 13.0/40.9/0.128/2.93e-07 2.0/12.6/0.049/0.00e+00 8.0/21.0/0.074/2.34e-07 6.0/18.0/0.146/9.32e-10 2.0/5.0/0.040/0.00e+00 1.0/5.0/0.023/0.00e+00
4(250000) 13.7/44.2/0.173/3.57e-07 2.0/12.2/0.060/0.00e+00 8.0/21.0/0.093/2.64e-07 6.0/18.0/0.178/1.03e-09 2.0/5.0/0.050/0.00e+00 1.0/5.0/0.027/0.00e+00

Moreover, we use the performance profiles DM02 to visually compare the performance of these methods, as illustrated in Figures 1(a) and 1(b), which intuitively describe Iter¯\overline{\rm Iter} and NF¯\overline{\rm NF}, respectively. Recall that in the performance profile, efficiency and robustness can be evaluated at the left and right extremes of the graphic, respectively. In a word, the higher the curve, the better the method. It is very clear from Figure 1 that the acceleration process is efficient in its purpose of accelerating DFPM. Table 1 shows that the AA variants of three DFPM are all superior (in terms of Iter¯\overline{\rm Iter}, NF¯\overline{\rm NF} and F¯\overline{\|F^{\ast}\|}) to their originals for these chosen set of test problems, which also confirms the encouraging capability of AA for DFPM. In contrast, Tcpu¯\overline{\rm Tcpu} deteriorates in certain tests as a result of AA having to solve an extra optimization problem in each iteration. These results motivated us to further apply AA-DFPM to real-world problems.

5.2 Regularized decentralized logistic regression

We consider a real-world application, regularized decentralized logistic regression:

minxnf(x):=1Mi=1Mln(1+exp(biaix))+τ2x2,\min_{x\in\mathbb{R}^{n}}f(x):=\frac{1}{M}\sum_{i=1}^{M}\ln(1+\exp(-b_{i}a_{i}^{\top}x))+\frac{\tau}{2}\|x\|^{2}, (33)

where τ>0\tau>0 is a regularization parameter, 1Mi=1Mln(1+exp(biaix))\frac{1}{M}\sum\limits_{i=1}^{M}\ln(1+\exp(-b_{i}a_{i}^{\top}x)) represents the logistic loss function, and the data pairs (ai,bi)n×{1,1}(i=1,,M)(a_{i},b_{i})\in\mathbb{R}^{n}\times\{-1,1\}(i=1,\ldots,M) are taken from a given data set or distribution. It is easy to know that the objective function ff is strongly convex Luo21 . Hence xnx^{*}\in\mathbb{R}^{n} is a unique optimal solution to (33) if and only if it is a root of the following monotone nonlinear equations JYTH22 ,

F(x):=f(x)=1Mi=1Mbiexp(biaix)ai1+exp(biaix)+τx=0.F(x):=\nabla f(x)=\frac{1}{M}\sum_{i=1}^{M}\frac{-b_{i}\exp(-b_{i}a_{i}^{\top}x)a_{i}}{1+\exp(-b_{i}a_{i}^{\top}x)}+\tau x=0. (34)

Thus we can apply AA-DFPM to solve the above problem. Considering that problem (34) is unconstrained, we can dispense with the nonnegative constraints in the least squares problem in Algorithm 3 (line 3 of Step 5), let akk=1j=kmkk1ajka^{k}_{k}=1-\sum\limits_{j=k-{m_{k}}}^{k-1}a^{k}_{j}, so the least squares problem can be reformulated as

min(akmkk,,ak1k)rk+j=kmkk1ajk(rjrk)2.\min\limits_{(a^{k}_{k-{m_{k}}},\dots,a^{k}_{k-1})^{\top}}\left\|r_{k}+\sum\limits_{j=k-{m_{k}}}^{k-1}a^{k}_{j}(r_{j}-r_{k})\right\|^{2}. (35)

Our implementation solves above problem using QR decomposition. The QR decomposition of problem (35) at iteration kk can be efficiently obtained from that of at iteration k1k-1 in O(mkn)O(m_{k}n) GVL96 .

We exclusively focus on AA-SCGP, the top-performing method in our initial experiments, and compare it with two DFPM incorporating inertial acceleration: MITTCGP MJJYH23 and IHCGPM3 JYTH22 . The involved parameters for both MITTCGP and IHCGPM3 are set to their defaults, while the parameters used in AA-SCGP are taken from the experiment in last subsection. The regularization parameter is fixed at τ=0.01\tau=0.01 and the test instances are sourced from the LIBSVM data sets CL11 . Statistics of the test instances 111Datasets available at https://www.csie.ntu.edu.tw/\simcjlin/libsvmtools/datasets/. used are listed in Table 2. The termination criterion of all three algorithms is the same as the first experiment.

Table 2: Numerical results on the LIBSVM test instances
Data sets Data points Variables IHCGPM3 MITTCGP AA-SCGP
MM nn Iter¯\overline{\rm Iter}/NF¯\overline{\rm NF}/Tcpu¯\overline{\rm Tcpu}/F¯\overline{\|F^{\ast}\|} Iter¯\overline{\rm Iter}/NF¯\overline{\rm NF}/Tcpu¯\overline{\rm Tcpu}/F¯\overline{\|F^{\ast}\|} Iter¯\overline{\rm Iter}(NAA¯\overline{\rm NAA})/NF¯\overline{\rm NF}/Tcpu¯\overline{\rm Tcpu}/F¯\overline{\|F^{\ast}\|}
fourclass_scale 862 2 107.6/323.8/0.017/9.60e-07 246.6/740.4/0.038/9.82e-07 11.4(10.0)/23.8/0.011/2.18e-07
liver-disorders 145 5 492.2/1520.6/0.023/9.93e-07 1362.6/4136.8/0.065/9.98e-07 760.0(462.4)/2086.4/0.088/7.70e-07
phishing 11055 68 561.8/1685.6/14.016/9.93e-07 1394.0/4182.6/34.768/9.97e-07 565.0(243.4)/1131.0/13.651/5.14e-07
w4a 7366 300 584.6/1754.8/2.815/9.92e-07 1461.4/4384.4/7.050/9.97e-07 628.0(309.0)/1257.2/3.104/9.11e-07
w5a 9888 300 584.0/1752.6/3.714/9.95e-07 1459.8/4380.0/9.226/9.98e-07 520.0(87.0)/1041.0/3.039/9.53e-07
w6a 17188 300 584.4/1753.0/8.294/9.95e-07 1460.2/4381.6/20.713/9.96e-07 757.2(575.8)/1515.4/12.007/7.83e-07

Figure 2 displays the results of four test instances solved by three methods with a fixed initial point (0,,0)(0,\dots,0)^{\top}. One can observe that our method outperforms the others. Specifically, for different categories of data sets, AA-SCGP performs better in the case that has fewer data points MM and variables nn. The reason for this could be that data set “fourclass_scale” has been scaled to [-1,1]. For similar data sets “w5a” and “w6a”, AA-SCGP exhibits faster convergence for a bigger MM. Similarly, we use the MATLAB script “2*(rand(n,1)-0.5)” to generate the random initial point and run the same test 5 times for each test instance. Table 2 presents the average computational results, with the additional term NAA¯\overline{\rm NAA} denoting average number of AA. The results indicate that the quality of solutions obtained by AA-SCGP consistently outperforms those generated by IHCGPM3 and MITTCGP. For most test instances, AA-SCGP outperforms the other two inertial algorithms in terms of NF¯\overline{\rm NF} and Tcpu¯\overline{\rm Tcpu}. This benefits from the fact that AA-SCGP accelerates frequently during its iteration, in which the proportion of AA is 55.6%. These facts further illustrate that DFPM integrated with Anderson acceleration is valid and promising.

Refer to caption
(a) fourclass_scale
Refer to caption
(b) phishing
Refer to caption
(c) w5a
Refer to caption
(d) w6a
Figure 2: Change of F(x)\|F(x)\| for problem (34) with initial point (0,,0)(0,\dots,0)^{\top}

6 Conclusions

In this paper, we developed a novel algorithm of using Anderson acceleration (AA) for derivative-free projection method (DFPM) in solving convex-constrained monotone nonlinear equations. First, we reviewed the convergence of a general framework for DFPM, and then explored how AA can still be exploited with DFPM though it may not a fixed-point iteration. As a result, an acceleration algorithm (AA-DFPM) with slight modifications is proposed, and the global convergence of AA-DFPM is obtained with no additional assumptions. The convergence rate is further established based on some suitable conditions. The results on both preliminary numerical experiments and applications demonstrate the superior performance. As a future research, we plan to investigate a novel DFPM for general nonlinear equations without the monotonicity. We also intend to explore a simplified AA whose coefficient is easy to calculate, even for large window size.

Acknowledgements.
This work was supported by the National Key Research and Development Program (2020YFA0713504) and the National Natural Science Foundation of China (61977065).

References

  • (1) Meintjes, K., Morgan, A.P.: A methodology for solving chemical equilibrium systems. Appl. Math. Comput. 2(4), 333-361 (1987)
  • (2) Qu, B., Xiu, N.H.: A note on the CQ algorithm for the split feasibility problem. Inverse Probl. 21(5), 1655-1665 (2005)
  • (3) Chorowski, J., Zurada, J.M.: Learning understandable neural networks with nonnegative weight constraints. IEEE Trans. Neural Netw. Learn. Syst. 26(1), 62-69 (2014)
  • (4) Xiao, Y.H., Wang, Q.Y., Hu, Q.J.: Non-smooth equations based method for 1\ell_{1}-norm problems with applications to compressed sensing. Nonlinear. Anal. 74(11), 3570-3577 (2011)
  • (5) Jian, J.B., Yin, J.H., Tang, C.M., Han, D.L.; A family of inertial derivative-free projection methods for constrained nonlinear pseudo-monotone equations with applications. Comp. Appl. Math. 41, 309 (2022)
  • (6) Sun, D.F., Womersley, R.S., Qi, H.D.: A feasible semismooth asymptotically Newton method for mixed complementarity problems. Math. Program. 94(1), 167-187 (2002)
  • (7) Qi, L.Q., Tong, X.J., Li, D.H.: Active-set projected trust-region algorithm for boxconstrained nonsmooth equations. J. Optim. Theory Appl. 120(3), 601-625 (2004)
  • (8) Yin, J.H., Jian, J.B., Ma, G.D.: A modified inexact Levenberg-Marquardt method with the descent property for solving nonlinear equations. Comput. Optim. Appl. 87, 289-322 (2023)
  • (9) Solodov, M.V., Svaiter, B.F.: A globally convergent inexact Newton method for systems of monotone equations. In: Fukushima, M., Qi. L. (eds.): Reformulation: Nonsmooth, Piecewise Smooth, Semisooth and Smoothing Methods, pp. 355-369. Kluwer, Dordrecht (1998)
  • (10) Liu, J.K., Feng, Y.M.: A derivative-free iterative method for nonlinear monotone equations with convex constraints. Numer. Algorithms 82, 245-262 (2019)
  • (11) Liu, J.K., Lu, Z.L., Xu, J.L., Wu, S., Tu, Z.W., An efficient projection-based algorithm without Lipschitz continuity for large-scale nonlinear pseudo-monotone equations. J. Comput. Appl. Math. 403, 113822 (2022)
  • (12) Ibrahim, A.H., Kumam, P., Rapajić, S., Papp, Z., Abubakar, A.B.: Approximation methods with inertial term for large-scale nonlinear monotone equations. Appl. Numer. Math. 181, 417-435 (2022)
  • (13) Wu, X.Y., Shao, H., Liu, P.J., Zhuo, Y.: An inertial spectral CG projection method based on the memoryless BFGS update. J. Optim. Theory Appl. 198, 1130-1155 (2023)
  • (14) Yin, J.H., Jian, J.B., Jiang, X.Z., Liu, M.X., Wang, L.Z.: A hybrid three-term conjugate gradient projection method for constrained nonlinear monotone equations with applications. Numer. Algorithms 88(1), 389-418(2021)
  • (15) Waziri, M.Y., Ahmed, K.: Two descent Dai-Yuan conjugate gradient methods for systems of monotone nonlinear equations. J. Sci. Comput. 90, 1-53 (2022)
  • (16) Ma, G.D., Jin, J.C., Jian, J.B., Yin, J.H., Han, D.L.: A modified inertial three-term conjugate gradient projection method for constrained nonlinear equations with applications in compressed sensing. Numer. Algorithms 92, 1621-1653 (2023)
  • (17) Zhang, L., Zhou, W.J.: Spectral gradient projection method for solving nonlinear monotone equations. J. Comput. Appl. Math. 196, 478-484 (2006)
  • (18) Ou, Y.G.; Li, J.Y.: A new derivative-free SCG-type projection method for nonlinear monotone equations with convex constraints. J. Appl. Math. Comput. 56, 195-216 (2018)
  • (19) Amini, K., Kamandi, A.: A new line search strategy for finding separating hyperplane in projection-based methods. Numer. Algorithms 70, 559-570 (2015)
  • (20) Li, Q.N., Li, D.H.: A class of derivative-free methods for large-scale nonlinear monotone equations. IMA J. Numer. Anal. 31(4), 1625-1635 (2011)
  • (21) Ou, Y.G., Li, L.: A unified convergence analysis of the derivative-free projection-based method for constrained nonlinear monotone equations. Numer. Algorithms 93(4), 1639-1660 (2023)
  • (22) Goncalves, M.L.N., Menezes, T.C.: A framework for convex-constrained monotone nonlinear equations and its special cases. Comp. Appl. Math. 42, 306 (2023)
  • (23) Anderson, D.G.; Iterative procedures for nonlinear integral equations. J. Assoc. Comput. Mach. 12 547-560 (1965)
  • (24) Alvarez, F.: On the minimizing property of a second order dissipative system in Hilbert spaces. SIAM J. Control Optim. 38(4), 1102-1119 (2000)
  • (25) Nesterov, Y.: A method of solving a convex programming problem with convergence rate O(1/k2)O(1/k^{2}). Soviet Math. Dokl. 27, 372-376 (1983)
  • (26) Walker, H.F., Ni, P.: Anderson acceleration for fixed-point iterations. SIAM J. Numer. Anal. 49(4), 1715-1735 (2011)
  • (27) Saad, Y., Schultz, M.H.: Gmres: A generalized minimal residual algorithm for solving nonsymmetric linear systems. SIAM J. Sci. Stat. Comput. 7(3), 856-869 (1986)
  • (28) Potra, F.A., Engler, H.: A characterization of the behavior of the Anderson acceleration on linear problems. Linear Algebra Appl. 438(3), 1002-1011 (2013)
  • (29) Wang, D.W., He, Y.H., Sterck, H.D.: On the asymptotic linear convergence speed of Anderson acceleration applied to ADMM. J. Sci. Comput. 88(2), 38 (2021)
  • (30) Fang, H.R., Saad, Y.: Two classes of multisecant methods for nonlinear acceleration. Numer. Linear Algebra Appl. 16(3), 197-221 (2009)
  • (31) Wang, S.Y., Chen, W.Y., Huang, L.W., Zhang, F., Zhao, Z.T., Qu, H.: Regularization-adapted Anderson acceleration for multi-agent reinforcement learning, Knowl.-Based Syst. 275, 110709 (2023)
  • (32) Pollock, S., Rebholz, L.G., Xiao, M.Y.: Anderson-accelerated convergence of Picard iterations for incompressible Navier-Stokes equations. SIAM J. Numer. Anal. 57, 615-637 (2019)
  • (33) Yang, Y.N.: Anderson acceleration for seismic inversion. Geophysics 86(1), R99-R108 (2021)
  • (34) Toth, A., Kelley, C.: Convergence analysis for Anderson acceleration. SIAM J. Numer. Anal. 53, 805-819 (2015)
  • (35) Mai, V., Johansson, M.: Anderson acceleration of proximal gradient methods. In Int. Conf. Mach. Learn., PMLR, 6620-6629 (2020)
  • (36) Bian, W., Chen, X.J.: Anderson acceleration for nonsmooth fixed point problems. SIAM J. Numer. Anal. 60(5), 2565-2591 (2022)
  • (37) Rebholz, L.G., Xiao, M.Y.: The effect of Anderson acceleration on superlinear and sublinear convergence. J. Sci. Comput. 96(2), 34 (2023)
  • (38) Garner, C., Lerman, G., Zhang, T.: Improved convergence rates of Anderson acceleration for a large class of fixed-point iterations. arXiv preprint arXiv: 2311.02490 (2023).
  • (39) Cai, X.J., Gu, G.Y., He, B.S.: On the O(1/t)O(1/t) convergence rate of the projection and contraction methods for variational inequalities with Lipschitz continuous monotone operators. Comput. Optim. Appl. 57(2), 339-363 (2014)
  • (40) Polyak, B.T.: Introduction to Optimization, Optimization Software, Inc. Publications Division, New York (1987).
  • (41) Scieur, D., d’Aspremont, A., Bach, F.: Regularized nonlinear acceleration. In Adv. Neural Inf. Process. Syst. 712-720 (2016)
  • (42) Ouyang, W.Q., Tao, J., Milzarek, A., Deng, B.L.: Nonmonotone globalization for Anderson Acceleration via adaptive regularization. J. Sci. Comput. 96, 5 (2023)
  • (43) Henderson, N.C., Varadhan, R.: Damped Anderson acceleration with restarts and monotonicity control for accelerating EM and EM-like algorithms. J. Comput. Graph. Stat. 28, 834-846 (2019)
  • (44) Ouyang, W.Q., Peng, Y., Yao, Y.X., Zhang, J.Y., Deng, B.L.: Anderson acceleration for nonconvex ADMM based on Douglas-Rachford splitting. Comput. Graph. Forum 39, 221-239 (2020)
  • (45) Zhang, J.Y., O’Donoghue, B., Boyd, S.: Globally convergent type-I Anderson acceleration for nonsmooth fixed-point iterations. SIAM J. Optim. 30(4), 3170-3197 (2020)
  • (46) Anderson, D.G.M.: Comments on “Anderson Acceleration, Mixing and Extrapolation”. Numer. Algorithms 80, 135-234 (2019)
  • (47) Chen, C.H., Ma, S.Q., Yang, J.F.: A general inertial proximal point algorithm for mixed variational inequality problem. SIAM J. Optim. 25, 2120-2142 (2014)
  • (48) Amini, K., Faramarzi, P.: Global convergence of a modified spectral three-term CG algorithm for nonconvex unconstrained optimization problems. J. Comput. Appl. Math. 417, 114630 (2023)
  • (49) Dolan, E.D., Moré, J.J.: Benchmarking optimization software with performance profiles. Math. Program. 91(2), 201-213 (2002)
  • (50) Luo, H.: Accelerated primal-dual methods for linearly constrained convex optimization problems. arXiv preprint arXiv: 2109.12604 (2021).
  • (51) Golub, G.H., Van Loan, C.: Matrix Computations, 3rd ed., The Johns Hopkins University Press, Baltimore, MD, (1996)
  • (52) Chang, C.C., Lin, C.J.: LIBSVM: a library for support vector machines. ACM Trans. Intell. Syst. Technol. 2(3), 1-27 (2011)