[1]\fnmGonglin \surYuan \equalcontThese authors contributed equally to this work.

[1]\orgdivGuangxi University, \orgnameSchool of Mathematics and Information Science, Center for Applied Mathematics of Guangxi (Guangxi University), \orgaddress\streetStreet, \cityNanning, \postcode530004, \stateGuangxi, \countryChina

2]\orgnameThai Nguyen University of Economics and Business Administration, \orgaddress\streetStreet, \stateNguyen, \countryThai

A Diagonal BFGS Update Algorithm with Inertia Acceleration Technology for Minimizations

\fnmZhenhua \surLuo [email protected] [email protected] \fnmHongtruong \surPham [email protected] * [

Abstract

We integrate the diagonal quasi-Newton update approach with the enhanced BFGS formula proposed by Wei, Z., Yu, G., Yuan, G., Lian, Z. [1], incorporating extrapolation techniques and inertia acceleration technology. This method, designed specifically for non-convex constrained problems, requires that the search direction ensures sufficient descent and establishes global linear convergence. Such a design has yielded exceptionally favorable data results.

keywords:

Non-convex Unconstrained Optimization, q-linear convergence, Quasi-Newton Methods, Diagonal Quasi-Newton Updates, Inertia Algorithms.

1 Introduction

Address the unrestricted optimization dilemma of reducing a continuously differentiable function $f:\mathbb{R}^{n}\rightarrow\mathbb{R}$ , represented as:

\min_{x\in\mathbb{R}^{n}}f(x).

(1.1)

Quasi-Newton methods belong to a category of iterative optimization algorithms that employ quasi-Newton approximations of the Hessian matrix to address the problem that defined in (1.1). The underlying principle involves forming a sequence of iterates $\{x_{k}\}$ designed to converge towards a fixed point of $f$ , achieved through the iterative refinement of an approximate Hessian matrix $B_{k}$ to expedite optimization procedures with $f$ at each iteration. The foundational principles of the quasi-Newton approach originated with Davidon [2], and subsequent enhancements and applications have been explored by numerous scholars [3, 4, 5]. We start with $x_{0}\in\mathbb{R}^{n}$ and $B_{0}\in\mathbb{R}^{n\times n}$ . The iterative process is updated by generating a sequence $\{x_{k}\}$ to iteratively refine the estimation of the Hessian matrix:

x_{k+1}=x_{k}+\alpha_{k}d_{k},

(1.2)

In $\mathbb{R}^{n}$ , the determination of the search direction is governed by the line search vector $d_{k}$ , and $\alpha_{k}>0$ represents the determination of the $\alpha_{k}$ is conducted via line search procedure. The meaning of $g_{k}$ represents partial gradient of $f$ at $x_{k}$ , represented as a vector of the function we are looking for:

B_{k}d_{k}=-g_{k},

(1.3)

$s_{k}$ represents the difference vector between consecutive iterates, computed as $x_{k+1}-x_{k}$ , and the variable $y_{k}$ denotes the discrepancy vector between the gradients observed at consecutive iterations, formulated as $g_{k+1}-g_{k}$ . In algorithm for approximating the Hessian matrices, the vectors $s_{k}$ and $y_{k}$ are crucial for the enhancement of the algorithm, which is realized by an update process. BFGS updating is the most widely used among those formulations, expressed as:

B_{k+1}=B_{k}-\frac{B_{k}s_{k}s_{k}^{T}B_{k}}{s_{k}^{T}B_{k}s_{k}}+\frac{y_{k}y_{k}^{T}}{y_{k}^{T}s_{k}}.

(1.4)

Computational evidence has consistently demonstrated the superior effectiveness to other quasi-Newton methods. Considerable scholarly investigations have been dedicated to exploring the convergence characteristics of the technique , particularly in a realm of convex minimization, as evidenced by studies such as [6, 7, 8, 3, 4, 5]. These formulations have demonstrated high effectiveness in practical scenarios across a diverse range of optimization problems. By incorporating the insights garnered from these studies, we refine our existing formula $B_{k}$ in (1.4), leading to enhanced convergence performance.

Drawing inspiration from Neculai Andrei’s work on diagonal Hessian matrix from a function $f$ [9], we leverage the efficacy of the update formula (1.4) and its variants. The approach discerns the components positioned along the principal diagonal of the approximative matrix which, adhering to the principles outlined in the theory from Dennis and Wolkowicz’s [10] secant updating method with minimal changes, is a key aspect to be addressed. Let us now delve into this consideration:

B_{k}=\text{diag}(b_{k}^{1},\ldots,b_{k}^{n}),

(1.5)

$b_{k}^{i}>0$ holds for every $i=1,\ldots,n$ , an estimate of components of the matrix $B_{k}$ for $f$ at iteration $k$ . Incorporated within the algorithm, a computation of the search direction which is expressed as $d_{k}=-B_{k}^{-1}g_{k}$ , specifically, $d_{k}^{i}=-{g_{k}^{i}}/{b_{k}^{i}}$ for $i=1,\ldots,n$ . Notably, as $B_{k}$ is both diagonal and positive definite, every element undergoes multiplication through distinct positive factors, determined by the diagonal factors $B_{k}$ , in this particular algorithm. To facilitate the use of the minimum change secant update strategy, let us consider the assumption that the diagonal matrix $B_{k}$ possesses positive definiteness. The updated matrix $B_{k+1}$ , defined as an outcome of the update to $B_{k}$ , can be expressed as:

B_{k+1}=B_{k}+\Delta_{k},

(1.6)

where $\Delta_{k}=\text{diag}(\delta_{k}^{1},\cdots,\delta_{k}^{n})$ is a correction matrix that satisfies certain conditions to ensure the accuracy and stability between $B_{k}$ and $B_{k+1}$ . More precisely, the matrix $\Delta_{k}$ serves as a correction to $B_{k}$ , ensuring that $B_{k+1}$ is positive definite and adheres to the weak orthogonal cut condition proposed by Dennis and Wolkowicz [10]:

s_{k}^{T}B_{k+1}s_{k}=s_{k}^{T}y_{k}.

Utilizing (1.11) and the weak WWP step size rule,

	$\displaystyle f(x_{k}+\alpha_{k}d_{k})$	$\displaystyle\leq f(x_{k})+\rho\alpha_{k}g_{k}^{T}d_{k},$		(1.7)
	$\displaystyle g_{k+1}^{T}d_{k}$	$\displaystyle\geq\sigma g_{k}^{T}d_{k},$		(1.8)

Meanwhile, Wei, Z., Yu, G., Yuan, G., Lian, Z. introduced a novel formula[1]:

B_{k+1}s_{k}=y_{k}^{*},

(1.9)

where $y^{*}_{k}=y_{k}+Y_{k}(3)s_{k}$ and

Y_{k}(3)=\frac{2[f(x_{k})-f(x_{k+1})]+(g(x_{k+1})+g(x_{k}))^{T}s_{k}}{\|s_{k}\|^{2}}I.

(1.10)

By using (1.9), they provided a BFGS-type update,

B_{k+1}=B_{k}-\frac{B_{k}s_{k}s_{k}^{T}B_{k}}{s_{k}^{T}B_{k}s_{k}}+\frac{y_{k}^{*}(y_{k}^{*})^{T}}{s_{k}^{T}y_{k}^{*}}.

(1.11)

Therefore, our attention shifts to the modified BFGS method, a formulation demonstrated to exhibit superlinear convergence, as proposed by Wei Wei, Li, and Qi [1]. They have introduced several modifications to the BFGS method incorporating the recently introduced quasi-Newton condition $B_{k+1}s_{k}=y_{k}^{*}$ . Employing this property within the framework enables the establishment of its linear convergence.

This observation motivates us to combine these two derivations of $B_{k}$ and obtain a new improved $B_{k}$ formula, which simplifies the proof of convergence. In utilizing the iterative update approach, let us make the assumption that the matrix $B_{k}$ should be positive definite. So $B_{k+1}$ derived from the update $B_{k}$ which is defined by $B_{k+1}s_{k}=y_{k}^{*}$ , expressing $y^{*}$ as a formula to approach $\nabla^{2}f(x)s$ , recognizing $y$ and $Y_{k}^{(3)}$ are vectors. Often, in practice, $\alpha_{k}$ from (1.2) is defined from WWP step rules [11, 12] with the selection of positive coefficients $\sigma$ and $\rho$ ensuring $1>\sigma>\rho>0$ .

Section 1 of this paper introduces the research problem and recent advancements in related studies. Section 2 explains the motivation behind this research and proposes the improved diagonal quasi-Newton algorithm, DMBFGS3, as well as the WDMBFGS3 algorithm, which incorporates extrapolation techniques. Section 3 proves the global linear convergence of both algorithms. Section 4 validates the rapid convergence and stability of the WDMBFGS3 algorithm through numerical experiments. Section 5 wraps up the findings discussed in this paper and proposes avenues for further investigation.

2 Motivation and algorithm: Optimizing Hessian with Diagonal Minimization

As previously mentioned, the matrix $\Delta_{k}$ from (1.6) is ascertained as the result of the following questions:

	$\displaystyle\min$	$\displaystyle\frac{1}{2}\\|\Delta_{k}\\|_{F}^{2}+\mathrm{tr}(B_{k}+\Delta_{k}),$		(2.1)
	$\displaystyle s.t.$	$\displaystyle\quad s_{k}^{T}B_{k+1}s_{k}=s_{k}^{T}y_{k}$		(2.1)

Then we can further modify the definitions of $B_{k}$ and $y_{k}$ ,obtain the following formula:

	$\displaystyle\min$	$\displaystyle\frac{1}{2}\\|\Delta_{k}\\|_{F}^{2}+\mathrm{tr}(B_{k}+\Delta_{k})$		(2.2)
	$\displaystyle s.t.$	$\displaystyle\quad s_{k}^{T}B_{k+1}s_{k}=s_{k}^{T}y_{k}^{*},$		(2.2)

The objective function of problem (2.2) constitutes a linear combination of minimizing the difference between $B_{k}$ and $B_{k+1}$ according to the principle of variation, along with minimizing the sum of the diagonal elements of $B_{k+1}$ . The incorporation of $\text{tr}(B_{k}+\Delta_{k})$ within the target function from (2.1) is motivated by the aim to derive a correction matrix $\Delta_{k}$ ,

		$\displaystyle\min_{\Delta_{k}}\frac{1}{2}\sum_{i=1}^{n}(\delta_{k}^{i})^{2}+\sum_{i=1}^{n}(b_{k}^{i}+\delta_{k}^{i})$		(2.3)
	$\displaystyle s.t.$	$\displaystyle\quad\sum_{i=1}^{n}(s_{k}^{i})^{2}\delta_{k}^{i}=s_{k}^{T}y_{k}+s_{k}^{T}Y_{k}(3)-\sum_{i=1}^{n}(s_{k}^{i})^{2}b_{k}^{i},$		(2.3)

where $\Delta_{k}=\operatorname{diag}(\delta_{k}^{1},\dots,\delta_{k}^{n})$ , $s_{k}=\begin{pmatrix}s_{k}^{1},\dots,s_{k}^{n}\end{pmatrix}^{T}$ , and $y_{k}=\nabla^{2}f(x_{k})s_{k}.$ The Lagrangian corresponding to problem (2.3) can be said to be:

L=\frac{1}{2}\sum_{i=1}^{n}(\delta_{k}^{i})^{2}+\sum_{i=1}^{n}(b_{k}^{i}+\delta_{k}^{i})+\lambda\left(\sum_{i=1}^{n}(s_{k}^{i})^{2}\delta_{k}^{i}-s_{k}^{T}y_{k}-s_{k}^{T}Y_{k}(3)+\sum_{i=1}^{n}(s_{k}^{i})^{2}b_{k}^{i}\right),

(2.4)

that $\lambda$ represents the lagrangian coefficient.Solutions pursued to this problem (2.3) correspond to a fixed point of the Lagrangian. Consequently, we take the partial derivative of (2.4) for $\delta_{k}^{i}$ , so we can obtain:

\delta_{k}^{i}+1+\lambda(s_{k}^{i})^{2}=0,\quad i=1,\dots,n,

(2.5)

By transferring items, we can obtain:

\delta_{k}^{i}=-1-\lambda(s_{k}^{i})^{2},\quad i=1,\dots,n.

(2.6)

To determine $\lambda$ , we use the constraint in (2.3) and substitute (2.6) into it, which gives:

\lambda=\frac{\sum_{i=1}^{n}(s_{k}^{i})^{2}b_{k}^{i}-s_{k}^{T}y_{k}-s_{k}^{T}Y_{k}(3)-\sum_{i=1}^{n}(s_{k}^{i})^{2}}{\operatorname{tr}(A_{k}^{2})},

(2.7)

where $A_{k}=\operatorname{diag}((s_{k}^{1})^{2},\dots,(s_{k}^{n})^{2})$ . By utilizing (2.7) in (2.6), we can express the diagonal elements of $\Delta_{k}$ as follows:

\delta_{k}^{i}=\frac{s_{k}^{T}y_{k}+s_{k}^{T}Y_{k}(3)+s_{k}^{T}s_{k}-s_{k}^{T}B_{k}s_{k}}{\operatorname{tr}(A_{k}^{2})}(s_{k}^{i})^{2}-1,\quad i=1,\dots,n,

(2.8)

where $B_{k}$ is the serves as the Hessian matrix. This diagonal correction matrix $\Delta_{k}$ can be articulated as

\Delta_{k}=\left(\frac{s_{k}^{T}y_{k}+s_{k}^{T}Y_{k}(3)+s_{k}^{T}s_{k}-s_{k}^{T}B_{k}s_{k}}{\mathrm{tr}(A_{k}^{2})}\right)A_{k}-I.

(2.9)

Then we have the following equation:

B_{k+1}=B_{k}+\left(\frac{s_{k}^{T}y_{k}+s_{k}^{T}Y_{k}(3)+s_{k}^{T}s_{k}-s_{k}^{T}B_{k}s_{k}}{\mathrm{tr}(A_{k}^{2})}\right)A_{k}-I,

(2.10)

i.e., on components,

b_{k+1}^{i}=b_{k}^{i}+\frac{s_{k}^{T}y_{k}+s_{k}^{T}Y_{k}(3)+s_{k}^{T}s_{k}-s_{k}^{T}B_{k}s_{k}}{\operatorname{tr}(A_{k}^{2})}(s_{k}^{i})^{2}-1,\quad i=1,\dots,n,

(2.11)

The vector is then ubsequently performed as:

d_{k+1}=-B_{k+1}^{-1}g_{k+1},

(2.12)

where $g_{k+1}$ is the gradient at the new point, and $d_{k+1}$ is the search direction. It is essential to note that, in order to prevent the update matrix from degenerating or becoming an indefinite matrix $B_{k+1}$ , it’s necessary that $b_{k+1}^{i}=b_{k}^{i}+\delta_{k}^{i}>0$ holds for all $i=1,\ldots,n$ . If $b_{k+1}^{i}\geq\epsilon$ , then $d_{k+1}=-\frac{g_{k+1}}{b_{k+1}^{i}}$ ; otherwise, $d_{k+1}=-g_{k+1}$ , where $\epsilon_{b}$ denotes a minor positive constant.

In practical scenarios, the pursuit of iterative algorithms with rapid convergence rates remains a consistent focus of research [13, 14, 15, 16, 17, 18]. Among these, the inertial extrapolation method has gained significant attention as an acceleration approach [19, 20]. These methods rely on iterative schemes where each subsequent term is derived from the preceding two terms. This concept originated from Polyak’s seminal work [21] and is rooted when dealing with a dissipative dynamical system that is second-order in time, we employ an implicit discretization technique.This arrangement is delineated by the subsequent differential equation:

x^{\prime\prime}(k)+\eta x^{\prime}(k)+\nabla f(x(k))=0,

(2.13)

Let $\eta>0$ and $f:\mathbb{R}^{n}\rightarrow\mathbb{R}^{n}$ be a differentiable function. The discretization of arrangement (2.13) enables the determination of the subsequent term $x_{k+1}$ given the previous terms $x_{k-1}$ and $x_{k}$ using an iterative scheme. This discretization process is crucial in the development of inertial extrapolation methods, which aim to enhance the convergence rate of iterative algorithms. By leveraging information from the two preceding terms, these methods efficiently approach the desired solution. The specific form of the discretization and the iterative scheme employed depend on the particular inertial extrapolation method in use. However, the underlying principle remains the same: harnessing the momentum generated by previous iterations to expedite the convergence process.
Arrangement (2.13) is discretized in such a manner that, possessing the terms $x_{k-1}$ and $x_{k}$ , the subsequent term $x_{k+1}$ can be ascertained utilizing

\frac{x_{k+1}-2x_{k}+x_{k-1}}{q^{2}}+\eta\frac{x_{k}-x_{k-1}}{q}+\nabla f(x_{k})=0,\quad k\geq 1,

(2.14)

where $q$ is the pace magnitude. Equation (2.14) begets the ensuing iterative plan:

x_{k+1}=x_{k}+\rho(x_{k}-x_{k-1})-\alpha\nabla f(x_{k}),\quad k\geq 1,

(2.15)

where $\rho=1-\eta q$ , $\alpha=q^{2}$ , and $\rho(x_{k}-x_{k-1})$ is denoted as the inertial extrapolation term aimed at accelerating the convergence of the series produced from Equation 2.15.

Recently, a plethora of algorithms rooted in the inertial acceleration technique, initially proposed by Alvarez [22], have emerged to tackle a diverse array of optimization and engineering challenges. Notably, the fusion of inertial acceleration with finely crafted search direction strategies, as advocated by Sun and Liu [23], has garnered significant attention.Building upon this groundwork, [24] has made significant contributions for addressing nonlinear monotone equations. The convergence being confined and meeting certain descent conditions:

\sum_{k=1}^{\infty}\alpha_{k}\|x_{k}-x_{k-1}\|<\infty,

(2.16)

The inertial extrapolation step magnitude is selected as:

\alpha_{k}:=\min\left\{\alpha,\frac{1}{k^{2}\|x_{k}-x_{k-1}\|}\right\},

(2.17)

In this case, where $\alpha\in[0,1)$ , the stipulation in (2.16) is satisfied automatically.As is commonly understood, the aforementioned condition is stronger than the one typically used:

\sum_{k=1}^{\infty}\alpha_{k}\|x_{k}-x_{k-1}\|<\infty,

(2.18)

There are some algorithms obtained through inertial extrapolation techniques [25, 20, 26, 19], it becomes apparent that by setting

\alpha_{k}:=\min\left\{\alpha,\frac{1}{(k\|x_{k}-x_{k-1}\|)^{2}}\right\},

(2.19)

Considering these two algorithms, we can observe that with increasing iterations, the discrepancy between $x_{k-1}$ and $x_{k}$ becomes smaller and smaller. Therefore, the step size determined by (2.16) will achieve better convergence compared to (2.18).

From the preceding process, we proposed the following algorithm:

Algorithm 1 The DMBFGS3 Algorithm

1:Select

x_{1}\in\mathbb{R}^{n}

\epsilon_{g}>0

sufficiently small and

\epsilon_{b}>0

and

B_{1}\in\mathbb{R}^{n\times n}

. Set

k=1

2:If

|g_{k}|

is less than or equal to

\epsilon_{g}

, stop.

3:Solve

B_{k}d_{k}+g_{k}=0

to find

d_{k}

4:Applying the WWP method to determine

\alpha_{k}

5:Update

x_{k+1}=x_{k}+\alpha_{k}d_{k}

. Calculate

Y_{k}(3)

from (1.10), and obtain

y^{*}_{k}=y_{k}+Y_{k}(3)

6:Diagonal elements for calculating the approximate value

B_{k}

of the Hessian matrix

b_{i,k+1}

i=1,\ldots,n

, as detailed in (2.11).

7:Calculate the search direction

d_{k+1}

: if

b_{k+1}^{i}\geq\epsilon_{b}

, then

d_{k+1}^{i}=-\frac{g_{k+1}^{i}}{b_{k+1}^{i}}

; otherwise, set

d_{k+1}^{i}=-g_{k+1}^{i}

8:Increment

k

by 1 and proceed with Step 2.

Based on extrapolation techniques, we have accelerated the convergence speed of the DMBFGS3 algorithm using extrapolation techniques. The resulting algorithm is as follows:

Algorithm 2 The WDMBFGS3 Algorithm

1:Select

x_{1}\in\mathbb{R}^{n}

\epsilon_{g}>0

sufficiently small and

\epsilon_{b}>0

and

B_{1}\in\mathbb{R}^{n\times n}

. Set

k=1

2:If

|g_{k}|

is less than or equal to

\epsilon_{g}

, stop. Compute

x_{k+1}=x_{k}+\alpha_{k}d_{k},

and

0\leq\tau_{k}\leq\tilde{\tau_{k}}

with

\tilde{\tau_{k}}:=\begin{cases}\begin{aligned} &\min\{\frac{1}{k^{2}\|x_{k}-x_{k-1}\|^{2}},\tau\},&&\text{if }x_{k}=x_{k-1},\\ &\tau,&&\text{otherwise.}\end{aligned}\end{cases}

3:Compute

f(p_{k})

. If

||f(p_{k})||\leq\text{Tol}

, stop. Otherwise, Solve

B_{k}d_{k}+g_{k}=0

to find

d_{k}

4:Applying the WWP method to determine

\alpha_{k}

5:Find

p_{k}=x_{k}+\tau_{k}(x_{k}-x_{k-1})

s_{k}=p_{k}-p_{k-1}

. Calculate

Y_{k}(3)

from (1.10) , and obtain

y^{*}_{k}=y_{k}+Y_{k}(3)

6:Diagonal elements for calculating the approximate value

B_{k}

of the Hessian matrix

b_{i,k+1}

i=1,\ldots,n

, as detailed in (2.11).

7:Calculate the search direction

d_{k+1}

: if

b_{k+1}^{i}\geq\epsilon_{b}

, then

d_{k+1}^{i}=-\frac{g_{k+1}^{i}}{b_{k+1}^{i}}

; otherwise, set

d_{k+1}^{i}=-g_{k+1}^{i}

8:Increment

k

by 1 and proceed with S Step 2.

Remark

•

We introduce a novel approach based on the theory of minimal alterations in secant updates proposed by Dennis and Wolkowicz [10]. In our method, we enforce the condition that the updating matrices are diagonal, which serves as the approximation to the Hessian. We have improved the search strategy for $y_{k}$ and achieved linear convergence for the entire algorithm.
•

Our experimental results demonstrate that the algorithm proposed by combining these two methods is more efficient and robust compared to other benchmark algorithms.

3 CONVERGENCE OF ALGORITHMS

The subsequent assumptions are necessary for this article:

•

The function $f$ exhibits continuous differentiability over the set $\Omega$ , with $\|g(m)-g(n)\|\leq L\|m-n\|$ for all $m,n\in\Omega$ , where $L$ is a non-negative constant.
•

The set of level points, denoted as $X=\{t\mid f(t)\leq f(t_{0})\}$ , is enclosed within the set $\Omega$ .
•

Indicating that $v_{1},v_{2}>0,v_{1},v_{2}\in\mathbb{R}^{n}$ , and the function $f$ is convex, such that $v_{1}\|t\|^{2}\leq t^{T}G(x)t\leq v_{2}\|t\|^{2}$ for all $t\in\mathbb{R}^{n}$ .
•

There exists $x^{*}\in\Omega$ which represents the solution to the minimization problem stated in (1.1).

The following Lemma 1 and Lemma 2 presented in [1, 9]are reproduced here for completeness.

Lemma 1.

Assuming that $||s_{k}||\neq 0$ for each $k$ , and the function $f$ meets the required conditions, consider the sequences of diagonal matrices $\{\delta_{k}\}$ and $\{B_{k}\}$ are generated by (2.9) and (2.10)

Assuming the existence of constants $\beta_{0}$ and $\gamma_{0}$ satisfying $\beta_{0}\leq b_{0}^{i}\leq\gamma_{0}$ for every $i=1,\ldots,n$ , where $b_{0}^{i}$ is the diagonal element of the matrix $B_{0}$ , it follows that the sequences $\{\delta_{k}\}$ and $\{B_{k}\}$ remain finite for each $k$ .

Proof.

By Equation 2.9 and Equation 2.10, since $B_{k}$ is a diagonal matrix, we can derive the iterative formula for its diagonal elements (2.11), and $y_{0}=\triangledown^{2}f(\xi_{0})s_{0},s_{k}=x_{k+1}-x_{k}$ . Because of $\left(s_{0}^{\max}\right)^{4}/tr\left(A_{0}^{2}\right)<1,$ use Equation 2.11 we can have:

$\displaystyle\\|\delta_{0}^{i}\\|$	$\displaystyle=\\|\frac{s_{0}^{T}y_{0}+s_{0}^{T}s_{0}-s_{0}^{T}B_{0}s_{0}}{\operatorname{tr}(A_{0}^{2})}(s_{0}^{i})^{2}-1\\|$	(3.1)
	$\displaystyle\leq\frac{\\|s_{0}^{T}y_{0}+s_{0}^{T}s_{0}-s_{0}^{T}B_{0}s_{0}\\|}{\operatorname{tr}(A_{0}^{2})}(s_{0}^{i})^{2}+1$
	$\displaystyle\leq\frac{\\|s_{0}^{T}\nabla^{2}f(\xi_{0})s_{0}+s_{0}^{T}s_{0}-s_{0}^{T}B_{0}s_{0}\\|}{\operatorname{tr}(A_{0}^{2})}(s_{0}^{\max})^{2}+1$
	$\displaystyle=\frac{\\|s_{0}^{T}\nabla^{2}f(\xi_{0})s_{0}+s_{0}^{T}s_{0}-s_{0}^{T}B_{0}s_{0}\\|}{(s_{0}^{\max})^{2}\operatorname{tr}(A_{0}^{2})}(s_{0}^{\max})^{4}+1$
	$\displaystyle\leq\frac{\\|s_{0}^{T}\nabla^{2}f(\xi_{0})s_{0}+s_{0}^{T}s_{0}-s_{0}^{T}B_{0}s_{0}\\|}{(s_{0}^{\max})^{2}}+1.$

Since $\|s_{k}\|\neq 0$ , it implies the existence of $\varsigma>0$ , $\varepsilon>0$ whereby $\varsigma\geq(s_{0}^{\mathrm{max}})^{2}$ , $\|s_{k}\|\geq\varepsilon$ . Therefore,

\|s_{0}\|^{2}=s_{0}^{T}s_{0}=\sum_{i=1}^{n}\left(s_{0}^{i}\right)^{2}\leq\sum_{i=1}^{n}\left(s_{0}^{\max}\right)^{2}=n\left(s_{0}^{\max}\right)^{2}.

Since $\xi_{0}\in\Omega$ , using of assumptions , we can deduce that

\|s_{0}^{T}\nabla^{2}f(\xi_{0})s_{0}\|\leq nK(s_{0}^{\mathrm{max}})^{2}\leq nK\varsigma^{2},\beta_{0}\|s_{0}\|^{2}\leq s_{0}^{T}B_{0}s_{0}\leq\gamma_{0}\|s_{0}\|^{2}.

Set $\Gamma=\max\{\|\beta_{0}\|,\|\gamma_{0}\|\},$

	$\displaystyle\\|\delta_{0}^{i}\\|$	$\displaystyle\leq\frac{\\|s_{0}^{T}\nabla^{2}f(\xi_{0})s_{0}\\|+\\|s_{0}^{T}s_{0}\\|+\\|s_{0}^{T}B_{0}s_{0}\\|}{(s_{0}^{\max})^{2}}+1$
		$\displaystyle\leq\frac{nK\varsigma^{2}+n\varsigma^{2}+n\Gamma\varsigma^{2}}{(s_{0}^{\max})^{2}}+1$
		$\displaystyle\leq\frac{(M+1+\Gamma)n\varsigma^{2}}{\varepsilon^{2}}+1\equiv c,$

for each $i=1,\ldots,n$ , use of $b_{k+1}^{i}=b_{k}^{i}+\delta_{k}^{i},$ , we can deduce that

\beta_{k}\leq b_{k}^{i}\leq\gamma_{k},\quad i=1,\ldots,n,

$\mathrm{that~{}}\beta_{k}=\beta_{0}-(k-1)c,\gamma_{k}=\gamma_{0}+(k-1)c.$ The diagonal matrix sequence $\{\Delta_{k}\}$ is bounded , beacuse of Equation 1.6 , the sequence $\{B_{k}\}$ is bounded, too. ∎

Lemma 2.

Suppose that the required elements are produced by the algorithm DMBFGS3. Then for any $k$ , the function value $f(x_{k})$ can be represented as:

f(x_{k})=f(x_{k+1})+g(x_{k+1})^{T}\Delta x_{k}+\frac{1}{2}\Delta x_{k}^{T}B_{k+1}\Delta x_{k},

(3.2)

where $\Delta x_{k}=x_{k}-x_{k+1}$ .

Lemma 3.

Suppose that the elements consists of algorithm DMBFGS3 and that $G$ is continuous on $x^{*}$ . So

\lim_{k\to\infty}\|Y_{k}(3)\|=0.

(3.3)

Proof.

We derive the following equation through the expansion of Taylor’s series:

y_{k}^{T}s_{k}=\left(g_{k+1}-g_{k}\right)^{T}s_{k}=s_{k}^{T}G(\xi_{1k})s_{k},

and

f(x_{k})-f(x_{k+1})=-g_{k+1}^{T}s_{k}+\frac{1}{2}s_{k}^{T}G(\xi_{2k})s_{k},

where $\xi_{1k}=x_{k}+\theta_{1k}(x_{k+1}-x_{k}),\xi_{2k}=x_{k}+\theta_{2k}(x_{k+1}-x_{k}),$ and $\theta_{1k},\theta_{2k}\in(0,1).$ From the definition of $Y_{k}(3)$ and Lemma 2, we get

Y_{k}(3)=\frac{s_{k}^{T}B_{k+1}s_{k}-s_{k}^{T}G(\xi_{1k})s_{k}}{\|s_{k}\|^{2}}I,

and

s_{k}^{T}B_{k+1}s_{k}=s_{k}^{T}G(\xi_{2k})s_{k}.

Hence,

\|Y_{k}(3)\|\leq\|G(\xi_{2k})-G(\xi_{1k})\|.

Therefore, Lemma 3 holds, and $Y_{k}(3)$ is definitely bounded, we can set the boundary of this sequence as $\|Y_{k}(3)\|\leq N_{0}$ . ∎

Lemma 4.

Let $\{x_{k}\}$ be produced by the DMBFGS3 algorithm. There exists a sequence of positive numbers $\{b_{i}\}_{i=1}^{n}$ and a sufficiently large positive number $W$ , thereby for iterations $k>W$ , it holds that

	$\displaystyle\\|B_{k+1}-G(x^{*})\\|$	$\displaystyle\leq\\|B_{k}-G(x)\\|+\frac{2n}{\varepsilon_{1}}(c\sqrt{n}+1)\sigma_{k}+\sqrt{n}$		(3.4)
		$\displaystyle=\\|B_{k}-G(x)\\|+\omega_{1}\sigma_{k}+\sqrt{n},$		(3.4)

Proof.

From Equation (1.6), for a given $k$ , we have $B_{k+1}=B_{k}+\Delta_{k}$ . Hence,

	$\displaystyle\left\\|B_{k+1}-\nabla^{2}f(x^{*})\right\\|$	$\displaystyle=\left\\|B_{k}+\Delta_{k}-\nabla^{2}f(x^{*})\right\\|$		(3.5)
		$\displaystyle\leq\left\\|B_{k}-\nabla^{2}f(x^{*})\right\\|+\\|\Delta_{k}\\|$		(3.5)

Since $y_{k}^{*}=B_{k+1}s_{k}+Y_{k}(3)$ ,we have

	$\displaystyle\left\\|\Delta_{k}\right\\|$	$\displaystyle=\left\\|\frac{s_{k}^{T}{y_{k}}^{*}-s_{k}^{T}B_{k}s_{k}+s_{k}^{T}s_{k}}{\mathrm{tr}\left(A_{k}^{2}\right)}A_{k}-I\right\\|$		(3.6)
		$\displaystyle\leq\left\\|\frac{s_{k}^{T}B_{k+1}s_{k}-s_{k}^{T}B_{k}s_{k}}{\mathrm{tr}\left(A_{k}^{2}\right)}A_{k}\right\\|+\left\\|\frac{s_{k}^{T}s_{k}}{\mathrm{tr}\left(A_{k}^{2}\right)}A_{k}-I\right\\|+\left\\|\frac{s_{k}^{T}Y_{k}\left(3\right)}{\mathrm{tr}\left(A_{k}^{2}\right)}\right\\|.\quad$		(3.6)

Since $\|A_{k}\|=\sqrt{tr\left(A_{k}^{T}A_{k}\right)}=\sqrt{tr\left(A_{k}^{2}\right)}$ we have:

		$\displaystyle\left\\|\frac{s_{k}^{T}B_{k+1}s_{k}-s_{k}^{T}B_{k}s_{k}}{\mathrm{tr}\left(A_{k}^{2}\right)}A_{k}\right\\|=\frac{\left\\|s_{k}^{T}(B_{k+1}-B_{k})s_{k}\right\\|\left\\|A_{k}\right\\|}{\mathrm{tr}\left(A_{k}^{2}\right)}$		(3.7)
		$\displaystyle=\frac{\left\\|s_{k}^{T}(B_{k+1}-B_{k})s_{k}\right\\|}{\sqrt{\mathrm{tr}\left(A_{k}^{2}\right)}}=\frac{\left\\|B_{k+1}-B_{k}\right\\|\left\\|s_{k}\right\\|^{2}}{\sqrt{\mathrm{tr}\left(A_{k}^{2}\right)}}.$		(3.7)

Since $\|s_{k}\|\neq 0$ , exists $\alpha>0$ , $\varepsilon_{1}>0$ thereby $\alpha\geq(s_{k}^{\mathrm{max}})^{2}$ , $\|s_{k}\|\geq\varepsilon_{1}$ . Therefore, $n\alpha\geq n(s_{k}^{\mathrm{max}})^{2}\geq\|s_{k}\|^{2}\geq\varepsilon_{1}^{2}$ . Hence,

\alpha^{2}\geq\left(s_{k}^{\max}\right)^{2}\geq\frac{\varepsilon_{1}^{2}}{n}.

(3.8)

But,

tr\left(A_{k}^{2}\right)=\sum_{i=1}^{n}\left(s_{k}^{i}\right)^{4}\geq\left(s_{k}^{\max}\right)^{4}=\left(s_{k}^{\max}\right)^{2}\left(s_{k}^{\max}\right)^{2}n\frac{1}{n}\geq\frac{\varepsilon_{1}^{2}}{n}\left\|s_{k}\right\|^{2}\frac{1}{n}.

Hence

tr\left(A_{k}^{2}\right)\geq\frac{\varepsilon_{1}^{2}}{n^{2}}\left\|s_{k}\right\|^{2}.

Therefore,

\frac{\|s_{k}\|}{\sqrt{tr\left(A_{k}^{2}\right)}}\leq\frac{n}{\varepsilon_{1}}.

Since $\|s_{k}\|\neq 0$ , exists $\varepsilon_{1}>0$ thereby $\|s_{k}\|\geq\varepsilon_{1}$ . Therefore, $n(s_{k}^{\mathrm{max}})^{2}\geq\|s_{k}\|^{2}\geq\varepsilon_{1}^{2}$ . Hence,

\left(s_{k}^{\max}\right)^{2}\geq\frac{\varepsilon_{1}^{2}}{n}.

But,

tr\left(A_{k}^{2}\right)=\sum_{i=1}^{n}\left(s_{k}^{i}\right)^{4}\geq\left(s_{k}^{\max}\right)^{4}=\left(s_{k}^{\max}\right)^{2}\left(s_{k}^{\max}\right)^{2}n\frac{1}{n}\geq\frac{\varepsilon_{1}^{2}}{n}\left\|s_{k}\right\|^{2}\frac{1}{n}.

Hence

tr\left(A_{k}^{2}\right)\geq\frac{\varepsilon_{1}^{2}}{n^{2}}\left\|s_{k}\right\|^{2}.

So, we have

\frac{\|s_{k}\|}{\sqrt{tr\left(A_{k}^{2}\right)}}\leq\frac{n}{\varepsilon_{1}}

(3.9)

But, from Lemma 1,

\|s_{k}\|=\|x_{k+1}-x_{k}\|=\|x_{k+1}-x^{*}+x^{*}-x_{k}\|\leq\|x_{k+1}-x^{*}\|+\|x^{*}-x_{k}\|\leq 2\sigma_{k}.

Using (3.7) and Lemma 1, it deduces that:

\frac{\left\|B_{k+1}-B_{k}\right\|\left\|s_{k}\right\|^{2}}{\sqrt{\mathrm{tr}\left(A_{k}^{2}\right)}}=\frac{\left\|\Delta_{k}\right\|\left\|s_{k}\right\|\left\|s_{k}\right\|}{\sqrt{\mathrm{tr}\left(A_{k}^{2}\right)}}\leq c\sqrt{n}\frac{n}{\varepsilon_{1}}2\sigma_{k}.

(3.10)

Examining Equation 3.6, we find:

$\displaystyle\left\\|\frac{s_{k}^{T}s_{k}}{\operatorname{tr}\left(A_{k}^{2}\right)}A_{k}-I\right\\|$	$\displaystyle=\left\\|\frac{\left(s_{k}^{T}s_{k}\right)A_{k}-\operatorname{tr}\left(A_{k}^{2}\right)I}{\operatorname{tr}\left(A_{k}^{2}\right)}\right\\|$	(3.11)
	$\displaystyle=\frac{\left\\|\left(s_{k}^{T}s_{k}\right)A_{k}-\operatorname{tr}\left(A_{k}^{2}\right)I\right\\|}{\operatorname{tr}\left(A_{k}^{2}\right)}\leq\frac{\left\\|\left(s_{k}^{T}s_{k}\right)A_{k}+\operatorname{tr}\left(A_{k}^{2}\right)I\right\\|}{\operatorname{tr}\left(A_{k}^{2}\right)}$
	$\displaystyle\leq\frac{\\|s_{k}\\|^{2}\\|A_{k}\\|+\operatorname{tr}\left(A_{k}^{2}\right)\\|I\\|}{\operatorname{tr}\left(A_{k}^{2}\right)}=\frac{\\|s_{k}\\|^{2}\sqrt{\operatorname{tr}\left(A_{k}^{2}\right)}+\operatorname{tr}\left(A_{k}^{2}\right)\sqrt{n}}{\operatorname{tr}\left(A_{k}^{2}\right)}$
	$\displaystyle=\frac{\\|s_{k}\\|^{2}}{\sqrt{\operatorname{tr}\left(A_{k}^{2}\right)}}+\sqrt{n}\leq\frac{\\|s_{k}\\|^{2}}{\frac{\varepsilon_{1}}{n}\\|s_{k}\\|}+\sqrt{n}=\frac{n}{\varepsilon_{1}}\\|s_{k}\\|+\sqrt{n}\leq\frac{n}{\varepsilon_{1}}2\sigma_{k}+\sqrt{n}.$

Also, considering the terms in (3.8) and (3.9), we get:

\left\|\frac{s_{k}^{T}Y_{k}\left(3\right)}{\mathrm{tr}\left(A_{k}^{2}\right)}\right\|\leqslant\frac{\left\|s_{k}\right\|\left\|Y_{k}\left(3\right)\right\|}{\frac{\varepsilon_{1}^{2}}{n^{2}}\left\|s_{k}\right\|^{2}}=\frac{\left\|Y_{k}\left(3\right)\right\|}{\frac{\varepsilon_{1}^{2}}{n^{2}}\left\|s_{k}\right\|}\leqslant\frac{n^{2}N_{0}}{\varepsilon_{1}^{3}}.

(3.12)

From (3.5), using (3.6), (3.10), and (3.11), we have:

	$\displaystyle\left\\|B_{k+1}-\nabla^{2}f(x^{*})\right\\|$	$\displaystyle\leq\left\\|B_{k}-\nabla^{2}f(x^{*})\right\\|+\frac{2n}{\varepsilon_{1}}(c\sqrt{n}+1)\sigma_{k}+\sqrt{n}+\frac{n^{2}N_{0}}{\varepsilon_{1}^{3}}$
		$\displaystyle=\left\\|B_{k}-\nabla^{2}f(x^{*})\right\\|+\omega_{1}\sigma_{k}+\sqrt{n}+\frac{n^{2}N_{0}}{\varepsilon_{1}^{3}},\quad\quad\quad$

where $\omega_{1}=2n(c\sqrt{n}+1)/\varepsilon_{1}>0$ , i.e., (1.11) fulfills the property of bounded degradation.

∎

In particular, the sequences ${\{|B_{k}|\}}$ is bounded. The ensuing proposition establishes the quadratic-linear convergence of the DMFBGS3 technique. The validation relies on the characteristic of restricted deterioration as delineated in Lemma 2 (consult [27] for comprehensive validation).

Theorem 1.

Presume that the modification equation (2.10) adheres to the confined degradation attribute. Additionally, assume the existence of the affirmative constants $\varepsilon$ and $\delta$ ,we have:

\|x_{0}-x^{*}\|<\varepsilon\>,

Next, the succession $\{x_{k}\}$ produced by the technique DMBFGS3 is properly delineated, and $\{x_{k}\}$ converges q-linearly to $\{x\}$ .

Theorem 2.

Suppose $\{x_{k}\}$ and $\{p_{k}\}$ are produced by Algorithm Algorithm 2. Assuming $x^{*}\in\Omega$ , then according to Assumptions, it is true that

\lim_{k\to\infty}\|p_{k}-x^{*}\|=0.

Proof.

$\displaystyle\left\\|p_{k}-x^{*}\right\\|^{2}$	$\displaystyle=\left\\|x_{k}-x^{*}+\tau_{k}(x_{k}-x_{k-1})\right\\|^{2}$	(3.13)
	$\displaystyle\leq\left\\|x_{k}-x^{}\right\\|^{2}+2\tau_{k}(x_{k}-x_{k-1})^{T}\left(x_{k}+\tau_{k}(x_{k}-x_{k-1})-x^{}\right)$
	$\displaystyle\leq\left\\|x_{k}-x^{}\right\\|^{2}+2\tau_{k}\parallel x_{k}-x_{k-1}\parallel\left(\left\\|x_{k}-x^{}\right\\|+\tau_{k}\parallel x_{k}-x_{k-1}\parallel\right)$
	$\displaystyle\leq\left\\|x_{k}-x^{*}\right\\|^{2}+2M_{0}\tau_{k}\parallel x_{k}-x_{k-1}\parallel+4M_{0}\tau_{k}\parallel x_{k}-x_{k-1}\parallel$
	$\displaystyle=\left\\|x_{k}-x^{*}\right\\|^{2}+6M_{0}\tau_{k}\parallel x_{k}-x_{k-1}\parallel.$

Since the sequence $\{x_{k}\}$ is convergent, there exists an $M_{0}$ such that $||x_{k}-x^{*}||\leq M_{0}$ . Similarly, we can deduce that $||x_{k}-x_{k-1}||\leq 2M_{0}$ .Through Theorem 1, we can obtain $\sum_{k=1}^{\infty}\tau_{k}||x_{k}-x^{*}||<\infty,\sum_{k=1}^{\infty}\tau_{k}||x_{k}-x_{k-1}||<\infty$ , Through (3.13), we can obtain:

\displaystyle\sum_{k=1}^{\infty}\left\|p_{k}-x^{*}\right\|^{2}

\displaystyle<\infty

(3.14)

So that

\lim_{k\to\infty}\|p_{k}-x^{*}\|=0.

(3.15)

∎

The convergence of the algorithm DMBFGS3 is ensured, and the sequence $\{x_{k}\}$ exhibits q-linear convergence towards $x$ ., while the sequence $\{v_{k}\}$ also converges efficiently.

4 Data Analysis

All code was written and executed on a personal computer equipped with an Intel 13th Gen Intel Core(TM) i5-13490F processor (running at a speed of 2.50 GHz). The machine has 32.0 GB of RAM (31.8 GB usable). MATLAB r2021b was used for coding and execution.

To support the theoretical results, this paper compares two classic quasi-Newton optimization algorithms: DNRTR [9] and MBFGS3 [1]. MBFGS3 is an enhanced quasi-Newton algorithm with a modified term added to $y_{k}$ , while DNRTR is a diagonal quasi-Newton update method for unconstrained optimization.

4.1 CS problems

Candes et al. [28] and Donoho [29] have highlighted significant advancements in the field of Compressive Sensing (CS) in recent years.Compressive Sensing (CS) is a theoretical framework for efficiently acquiring and reconstructing sparse or approximately sparse signals. In this theory, sparsity implies that a signal can be described using a relatively small number of non-zero components, such as the coefficient vector in an appropriate basis or transform. In essence, Compressive Sensing involves the concept of representing a large, sparse signal $x$ with a limited number of linear measurements using a matrix $A$ and then storing $y=Ax$ . The primary difficulty lies in reconstructing the original signal $x$ from the observation vector $y$ .

We applied the proposed method to the CS problem, a swiftly expanding discipline that has attracted significant interest across multiple scientific areas. We assess the quality of the reconstructed signal $x^{*}$ by calculating the relative error compared to the original signal $x_{s}$ as follows:

\text{RelErr}:=100\frac{\|x^{*}-x_{s}\|}{\|x_{s}\|}.

Following the method described in [30], we created random test functions by setting the signal dimension to $n=2^{12}$ and using the Hadamard matrix for the sampling matrix $\psi$ . We performed a comparison among WDMBFGS3, MBFGS3, DMBFGS3, and DNRTR using a Bernoulli matrix $A$ , with entries randomly taking values of $+1$ or $-1$ with equal likelihood. This setup allows us to evaluate the performance of these algorithms under the specified conditions and matrix structure. The initial point was set to $0\in\mathbb{R}^{n}$ .To perform a meaningful comparison of these approaches’ performance, the resulting data was analyzed in terms of relative error (RelErr).

Figure 4 to Figure 6 compare MBFGS3, DMBFGS3, DNRTR, and WDMBFGS3 in terms of function values versus reconstruction ability (Refer to Figure 4-Figure 6, which illustrate the recovered signal (red circles) compared to the original signal (blue peaks)). These figures verify the competitiveness of WDMBFGS3 compared to MBFGS3, DMBFGS3, and DNRTR based on the corresponding values. Furthermore, by examining the plots and noting the real reconstruction errors, the effectiveness and resilience of the WDMBFGS3 method in retrieving large sparse signals become evident when compared to the alternative solvers. This approach is seamless and can be efficiently addressed using the provided algorithm.

Refer to caption — Figure 1: Diagram of the original signal.

4.2 74 Unconstrained Optimization Test Problems

To compare the numerical performance between WDMBFGS3 and DMBFGS3 methods, we compared them using various techniques . This study assessedthe CPU time (CPU) , the number of function evaluations (NFG), and the iteration count (NI) of the algorithms required. Additionally, the numerical experiments in this paper included two control groups: DNRTR and MBFGS3. The parameter data used in the experiments are as follows:
Problem dimensions: 900, 1500, and 2100 dimensions.
Parameters used: All algorithm parameters are specified in brackets [ $\sigma=0.8$ , $\rho=0.0001$ ].
Stop criterion: For all algorithms, if the condition $\|g_{k}\|\leq\epsilon=10^{-6}$ is met, or if the number of iterations exceeds 5000, the algorithm halts.
If this condition is not met, the algorithm stops when $\|g(x)\|<\epsilon$ satisfies any of the conditions, the algorithm stops, where $e1=10^{-5}$ .
Tested problems: This study assessed 74 unconstrained optimization challenges, with an extensive enumeration provided in the work of Andrei [27].

The figures illustrate the comparison on two aspects: the left side displays the percentage of the most efficient problems in the algorithm ( $\tau=1$ ), while the right side showcases the percentage of problems that the algorithm can solve ( $\tau=\infty$ ).For a given algorithm, the $\tau=1$ plot indicates its efficiency, while the $\tau=\infty$ plot represents its robustness.

To evaluate and compare the efficiency of these algorithms, we follow the methodology proposed by Dolan and More [17]. The parameters on the horizontal and vertical axes are defined as follows: " $\tau$ " represents the performance metric (NI, NFG, or CPU time), and the reciprocal of the ratio to the best performance among all methods is denoted as $Pp$ . The proportion of problems solved when the ratio of this method is less than parameter $\tau$ is indicated by $r(0.5)<\tau$ . From Figures 7, 8 and 9, it can be seen that the performance of the DMBFGS3 algorithm is, in a sense, the best, because 59% of the problems are solved by this algorithm with the fewest number of iterations (NI), outperforming MBFGS3 and DNRTR, which solve at most 22% of the test problems. At the same time, Figure 7 shows that WDMBFGS3 solves 55% of the problems with the fewest number of iterations, although this result is not the optimal one. However, when we zoom in on the horizontal axis and expand the $\tau$ value to 3, WDMBFGS3 can solve nearly 100% of the problems, indicating that WDMBFGS3 has stronger robust stability. From the figure, it can be clearly seen that WDMBFGS3, using extrapolation techniques, consistently maintains the highest number of victories (indicating the highest probability of being the superior solver). Figure 8 shows that WDMBFGS3 solves 68% of the evaluation tasks with the fewest number of total function and gradient evaluations (NFG). Notably, when the horizontal axis is zoomed in to $\tau=6$ , DMBFGS3 and WDMBFGS3 perform excellently, solving over 90% of the problems. Although DMBFGS3’s performance in processing data is close to WDMBFGS3’s when $\tau\geq 6$ , WDMBFGS3 solves these problems with fewer iterations (NI), so the algorithm with the extrapolation technique, WDMBFGS3, has a stronger advantage and competitiveness. Figure 9 shows the CPU time used by the algorithms, and it is clear that the WDMBFGS3 and DMBFGS3 algorithms significantly outperform the other two algorithms.

Table 1: WDMBFGS3 versus DMBFGS3 versus MBFGS3 and versus DNRTR

$n$	DNRTR		MBFGS3		DMBFGS3		WDMBFGS3
$n$	#iter	cpu	#iter	cpu	#iter	cpu	#iter	cpu
900	34	0.046875	35	0.65625	41	0.03125	33	0.0625
1500	27	0.34375	35	2.078125	35	0.046875	42	0.0625
2100	28	2.25	35	12.54688	113	1.234375	51	0.09375
Total	89	2.640625	105	15.28125	189	1.3125	126	0.21875

Table 1 presents the outcomes from the issues with the quantity of variables within the interval (900, 2100). Observably, for extensive-scale unconstrained optimization challenges, the performance of the WDMBFGS3 algorithm is notably best than other algorithms.

4.3 Muskingum Model

The nonlinear Muskingum model, frequently applied in hydrological engineering, addresses the challenges posed by optimization algorithms. A notable study by Ouyang et al.[31] presents this model, which is formulated as follows:

\min f(x_{1},x_{2},x_{3})=\sum_{i=1}^{n-1}{\left(\begin{array}[]{c}\left(1-\frac{\Delta t}{6}\right)x_{1}\left(x_{2}I_{i+1}+(1-x_{2})Q_{i+1}\right)^{x_{3}}\\ -\left(1-\frac{\Delta t}{6}\right)x_{1}\left(x_{2}I_{i}+(1-x_{2})Q_{i}\right)^{x_{3}}\\ -\frac{\Delta t}{2}(I_{i}-Q_{i})+\frac{\Delta t}{2}\left(1-\frac{\Delta t}{3}\right)(I_{i+1}-Q_{i+1})\\ \end{array}\right)^{2}},

where: $n$ denotes the total time steps, $x_{1}$ is the storage time constant, $x_{2}$ is the weighting factor, $x_{3}$ is an additional parameter, $\Delta t$ represents the time step at time $t_{i}$ ( $i=1,2,\ldots,n$ ), $I_{i}$ and $Q_{i}$ are the observed inflow and outflow discharges, respectively.

We cited [32] as the source of the research data for this paper. The time step $\Delta t$ was set to 12 hours, and the initial parameter values were $x=[0,1,1]^{T}$ . Comprehensive information on $I_{i}$ and $Q_{i}$ for 1961 is available in the work by Ouyang et al.[31]. Figure 11 displays both the recorded and computed outflows for the Muskingum model using the WDMBFGS3 in 1961, while Figure 11 presents the same for the DMBFGS3 in 1960. It is evident that the outflow curve fitted by the WDMBFGS algorithm is noticeably superior to that of the DMBFGS3.

5 Conclusion

This paper introduces an improved quasi-Newton algorithm, DMBFGS3, which incorporates diagonal transformations and an adjustment term on $y_{k}$ , further enhancing the stability of the algorithm. By introducing an extrapolation technique, the iterative process of the algorithm is accelerated, resulting in the proposed WDMBFGS3 algorithm. We discuss the performance of these two proposed algorithms in three different non-convex optimization experimental settings. A series of numerical experiments demonstrate that DMBFGS3 exhibits high efficiency and robustness when applied to non-convex optimization problems. Future research will focus on the following aspects: 1. Designing adaptive step sizes to avoid manual adjustments while ensuring theoretical convergence; 2. Investigating faster iterative setting strategies to improve the efficiency of model training.

Declarations

Data availability Data available on request from the authors.

Acknowledgement

This work is supported by Guangxi Science and Technology base and Talent Project (Grant No. AD22080047) and the special foundation for Guangxi Ba Gui Scholars.

References

\bibcommenthead
Wei et al. [2004] Wei, Z., Yu, G., Yuan, G., Lian, Z.: The superlinear convergence of a modified bfgs-type method for unconstrained optimization. Computational optimization and applications 29, 315–332 (2004)
Davidon [1991] Davidon, W.C.: Variable metric method for minimization. SIAM Journal on optimization 1(1), 1–17 (1991)
Li and Fukushima [2001] Li, D.-H., Fukushima, M.: A modified bfgs method and its global convergence in nonconvex minimization. Journal of Computational and Applied Mathematics 129(1-2), 15–35 (2001)
Li and Fukushima [1999] Li, D., Fukushima, M.: A globally and superlinearly convergent gauss–newton-based bfgs method for symmetric nonlinear equations. SIAM Journal on numerical Analysis 37(1), 152–172 (1999)
Li and Fukushima [2001] Li, D.-H., Fukushima, M.: On the global convergence of the bfgs method for nonconvex unconstrained optimization problems. SIAM Journal on Optimization 11(4), 1054–1064 (2001)
Dennis and Moré [1974] Dennis, J.E., Moré, J.J.: A characterization of superlinear convergence and its application to quasi-newton methods. Mathematics of computation 28(126), 549–560 (1974)
Griewank and Toint [1982] Griewank, A., Toint, P.L.: Local convergence analysis for partitioned quasi-newton updates. Numerische Mathematik 39(3), 429–448 (1982)
Pierre [1986] Pierre, D.A.: Optimization Theory with Applications. Courier Corporation, ??? (1986)
Andrei [2019] Andrei, N.: A diagonal quasi-newton updating method for unconstrained optimization. Numerical Algorithms 81(2), 575–590 (2019)
Dennis and Wolkowicz [1993] Dennis, J.E. Jr, Wolkowicz, H.: Sizing and least-change secant methods. SIAM Journal on Numerical Analysis 30(5), 1291–1314 (1993)
Wolfe [1969a] Wolfe, P.: Convergence conditions for ascent methods. SIAM review 11(2), 226–235 (1969)
Wolfe [1969b] Wolfe, P.: Convergence conditions for ascent methods. SIAM review 11(2), 226–235 (1969)
Chen et al. [2013] Chen, P., Huang, J., Zhang, X.: A primal–dual fixed point algorithm for convex separable minimization with applications to image restoration. Inverse Problems 29(2), 025011 (2013)
Iiduka [2012] Iiduka, H.: Iterative algorithm for triple-hierarchical constrained nonconvex optimization problem and its application to network bandwidth allocation. SIAM Journal on Optimization 22(3), 862–878 (2012)
Jolaoso et al. [2021] Jolaoso, L., Alakoya, T., Taiwo, A., Mewomo, O.: Inertial extragradient method via viscosity approximation approach for solving equilibrium problem in hilbert space. Optimization 70(2), 387–412 (2021)
Abubakar et al. [2020a] Abubakar, J., Kumam, P., Hassan Ibrahim, A., Padcharoen, A.: Relaxed inertial tseng’s type method for solving the inclusion problem with application to image restoration. Mathematics 8(5), 818 (2020)
Abubakar et al. [2020b] Abubakar, J., Kumam, P., Hassan Ibrahim, A., Padcharoen, A.: Relaxed inertial tseng’s type method for solving the inclusion problem with application to image restoration. Mathematics 8(5), 818 (2020)
Abubakar et al. [2019] Abubakar, J., Sombut, K., Ibrahim, A.H., et al.: An accelerated subgradient extragradient algorithm for strongly pseudomonotone variational inequality problems. Thai Journal of Mathematics 18(1), 166–187 (2019)
Alvarez [2004] Alvarez, F.: Weak convergence of a relaxed and inertial hybrid projection-proximal point algorithm for maximal monotone operators in hilbert space. SIAM Journal on Optimization 14(3), 773–782 (2004)
Alvarez and Attouch [2001] Alvarez, F., Attouch, H.: An inertial proximal method for maximal monotone operators via discretization of a nonlinear oscillator with damping. Set-Valued Analysis 9, 3–11 (2001)
Polyak [1964] Polyak, B.T.: Some methods of speeding up the convergence of iteration methods. Ussr computational mathematics and mathematical physics 4(5), 1–17 (1964)
Alvarez [2000] Alvarez, F.: On the minimizing property of a second order dissipative system in hilbert spaces. SIAM Journal on Control and Optimization 38(4), 1102–1119 (2000)
Sun and Liu [2015] Sun, M., Liu, J.: A modified hestenes–stiefel projection method for constrained nonlinear equations and its linear convergence rate. Journal of Applied Mathematics and Computing 49, 145–156 (2015)
Ibrahim et al. [2022] Ibrahim, A.H., Kumam, P., Sun, M., Chaipunya, P., Abubakar, A.B.: Projection method with inertial step for nonlinear equations: application to signal recovery. Journal of Industrial and Management Optimization 19(1), 30–55 (2022)
Moudafi and Elizabeth [2003] Moudafi, A., Elizabeth, E.: Approximate inertial proximal methods using the enlargement of maximal monotone operators. International Journal of Pure and Applied Mathematics 5(3), 283–299 (2003)
Moudafi and Oliny [2003] Moudafi, A., Oliny, M.: Convergence of a splitting inertial proximal method for monotone operators. Journal of Computational and Applied Mathematics 155(2), 447–454 (2003)
Andrei [2008] Andrei, N.: An unconstrained optimization test functions collection. Adv. Model. Optim 10(1), 147–161 (2008)
Candès et al. [2006] Candès, E.J., Romberg, J., Tao, T.: Robust uncertainty principles: Exact signal reconstruction from highly incomplete frequency information. IEEE Transactions on information theory 52(2), 489–509 (2006)
Donoho [2006] Donoho, D.L.: Compressed sensing. IEEE Transactions on information theory 52(4), 1289–1306 (2006)
Aminifard et al. [2022] Aminifard, Z., Hosseini, A., Babaie-Kafaki, S.: Modified conjugate gradient method for solving sparse recovery problem with nonconvex penalty. Signal Processing 193, 108424 (2022)
Ouyang et al. [2015] Ouyang, A., Liu, L.-B., Sheng, Z., Wu, F.: A class of parameter estimation methods for nonlinear muskingum model using hybrid invasive weed optimization algorithm. Mathematical Problems in Engineering 2015(1), 573894 (2015)
Luo et al. [2022] Luo, D., Li, Y., Lu, J., Yuan, G.: A conjugate gradient algorithm based on double parameter scaled broyden–fletcher–goldfarb–shanno update for optimization problems and image restoration. Neural Computing and Applications 34(1), 535–553 (2022)

$\displaystyle\\|\delta_{0}^{i}\\|$	$\displaystyle=\\|\frac{s_{0}^{T}y_{0}+s_{0}^{T}s_{0}-s_{0}^{T}B_{0}s_{0}}{\operatorname{tr}(A_{0}^{2})}(s_{0}^{i})^{2}-1\\|$	(3.1)
	$\displaystyle\leq\frac{\\|s_{0}^{T}y_{0}+s_{0}^{T}s_{0}-s_{0}^{T}B_{0}s_{0}\\|}{\operatorname{tr}(A_{0}^{2})}(s_{0}^{i})^{2}+1$
	$\displaystyle\leq\frac{\\|s_{0}^{T}\nabla^{2}f(\xi_{0})s_{0}+s_{0}^{T}s_{0}-s_{0}^{T}B_{0}s_{0}\\|}{\operatorname{tr}(A_{0}^{2})}(s_{0}^{\max})^{2}+1$
	$\displaystyle=\frac{\\|s_{0}^{T}\nabla^{2}f(\xi_{0})s_{0}+s_{0}^{T}s_{0}-s_{0}^{T}B_{0}s_{0}\\|}{(s_{0}^{\max})^{2}\operatorname{tr}(A_{0}^{2})}(s_{0}^{\max})^{4}+1$
	$\displaystyle\leq\frac{\\|s_{0}^{T}\nabla^{2}f(\xi_{0})s_{0}+s_{0}^{T}s_{0}-s_{0}^{T}B_{0}s_{0}\\|}{(s_{0}^{\max})^{2}}+1.$

	$\displaystyle\\|B_{k+1}-G(x^{*})\\|$	$\displaystyle\leq\\|B_{k}-G(x)\\|+\frac{2n}{\varepsilon_{1}}(c\sqrt{n}+1)\sigma_{k}+\sqrt{n}$		(3.4)
		$\displaystyle=\\|B_{k}-G(x)\\|+\omega_{1}\sigma_{k}+\sqrt{n},$		(3.4)

	$\displaystyle\left\\|B_{k+1}-\nabla^{2}f(x^{*})\right\\|$	$\displaystyle=\left\\|B_{k}+\Delta_{k}-\nabla^{2}f(x^{*})\right\\|$		(3.5)
		$\displaystyle\leq\left\\|B_{k}-\nabla^{2}f(x^{*})\right\\|+\\|\Delta_{k}\\|$		(3.5)

	$\displaystyle\left\\|\Delta_{k}\right\\|$	$\displaystyle=\left\\|\frac{s_{k}^{T}{y_{k}}^{*}-s_{k}^{T}B_{k}s_{k}+s_{k}^{T}s_{k}}{\mathrm{tr}\left(A_{k}^{2}\right)}A_{k}-I\right\\|$		(3.6)
		$\displaystyle\leq\left\\|\frac{s_{k}^{T}B_{k+1}s_{k}-s_{k}^{T}B_{k}s_{k}}{\mathrm{tr}\left(A_{k}^{2}\right)}A_{k}\right\\|+\left\\|\frac{s_{k}^{T}s_{k}}{\mathrm{tr}\left(A_{k}^{2}\right)}A_{k}-I\right\\|+\left\\|\frac{s_{k}^{T}Y_{k}\left(3\right)}{\mathrm{tr}\left(A_{k}^{2}\right)}\right\\|.\quad$		(3.6)

$\displaystyle\left\\|\frac{s_{k}^{T}s_{k}}{\operatorname{tr}\left(A_{k}^{2}\right)}A_{k}-I\right\\|$	$\displaystyle=\left\\|\frac{\left(s_{k}^{T}s_{k}\right)A_{k}-\operatorname{tr}\left(A_{k}^{2}\right)I}{\operatorname{tr}\left(A_{k}^{2}\right)}\right\\|$	(3.11)
	$\displaystyle=\frac{\left\\|\left(s_{k}^{T}s_{k}\right)A_{k}-\operatorname{tr}\left(A_{k}^{2}\right)I\right\\|}{\operatorname{tr}\left(A_{k}^{2}\right)}\leq\frac{\left\\|\left(s_{k}^{T}s_{k}\right)A_{k}+\operatorname{tr}\left(A_{k}^{2}\right)I\right\\|}{\operatorname{tr}\left(A_{k}^{2}\right)}$
	$\displaystyle\leq\frac{\\|s_{k}\\|^{2}\\|A_{k}\\|+\operatorname{tr}\left(A_{k}^{2}\right)\\|I\\|}{\operatorname{tr}\left(A_{k}^{2}\right)}=\frac{\\|s_{k}\\|^{2}\sqrt{\operatorname{tr}\left(A_{k}^{2}\right)}+\operatorname{tr}\left(A_{k}^{2}\right)\sqrt{n}}{\operatorname{tr}\left(A_{k}^{2}\right)}$
	$\displaystyle=\frac{\\|s_{k}\\|^{2}}{\sqrt{\operatorname{tr}\left(A_{k}^{2}\right)}}+\sqrt{n}\leq\frac{\\|s_{k}\\|^{2}}{\frac{\varepsilon_{1}}{n}\\|s_{k}\\|}+\sqrt{n}=\frac{n}{\varepsilon_{1}}\\|s_{k}\\|+\sqrt{n}\leq\frac{n}{\varepsilon_{1}}2\sigma_{k}+\sqrt{n}.$