This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Gradient and Hessian approximations in Derivative Free Optimization

Ian D. Coope and Rachael Tappenden
Abstract

This work investigates finite differences and the use of interpolation models to obtain approximations to the first and second derivatives of a function. Here, it is shown that if a particular set of points is used in the interpolation model, then the solution to the associated linear system (i.e., approximations to the gradient and diagonal of the Hessian) can be obtained in 𝒪(n)\mathcal{O}(n) computations, which is the same cost as finite differences, and is a saving over the 𝒪(n3)\mathcal{O}(n^{3}) cost when solving a general unstructured linear system. Moreover, if the interpolation points are formed using a ‘regular minimal positive basis’, then the error bound for the gradient approximation is the same as for a finite differences approximation. Numerical experiments are presented that show how the derivative estimates can be employed within an existing derivative free optimization algorithm, thus demonstrating one of the potential practical uses of these derivative approximations.

Keywords. Derivative Free Optimization; Positive Bases; Finite difference approximations; Interpolation models; Taylor series; Conjugate Gradients; Simplices; Simplex Gradients; Preconditioning; Frames.

1 Introduction

This work considers unconstrained optimization problems of the form

minx𝐑nf(x),\displaystyle\min_{x\in\mathbf{R}^{n}}f(x), (1)

where f:𝐑n𝐑f:\mathbf{R}^{n}\to\mathbf{R} is smooth, but the derivatives of ff are unavailable. This is a typical derivative free optimization setting. Problems of this form arise, for example when the function is smooth but it is computationally impractical to obtain derivative information, or when the function itself is not known explicitly although function values are available from a black-box or oracle, (for example via physical measurements).

A key ingredient of many algorithms for solving (1) in a derivative free setting is an estimate of the gradient of ff. Sometimes gradients are used to determine search directions (for example, in implicit filtering algorithms [5, 18, 19]), and sometimes they are used in the definition of an algorithm’s stopping criteria. Several techniques exist for determining estimates to derivatives, including automatic differentiation [21, 3, 25], finite difference approximations, interpolation models [8, 11], Gaussian smoothing [24, 29], and smoothing on the unit sphere [17]. This work investigates finite differences and interpolation models, so henceforth, we focus on these approaches.

A well known, simple, and widely applicable approach to estimating derivatives is finite difference approximations (or finite differences). Finite differences are so called because they provide estimates to derivatives by taking the difference between function values at certain known points. For example, for a function f:𝐑n𝐑f:\mathbf{R}^{n}\to\mathbf{R}, the forward differences approximation g(x)g(x) to the (first) derivative is

[g(x)]i=f(x+hei)f(x)h,fori=1,,n,[g(x)]_{i}=\frac{f(x+he_{i})-f(x)}{h},\qquad\text{for}\;i=1,\dots,n, (2)

which is an 𝒪(h)\mathcal{O}(h) accurate approximation. Many other (well known) formulations exist, including approximations that provide a higher accuracy estimates of the gradient, as well as formulations for higher order derivatives. The standard finite difference approximation formulations can be used whenever function values evaluated along coordinate directions are available. The formulations involve minimal computations (𝒪(n)\mathcal{O}(n) operations), and they have low memory requirements because one does not need to store the coordinate directions, but finite differences can suffer from cancellation errors.

Another approach to approximating derivatives is to use interpolation models [9, 33, 20]. Interpolation models can be viewed as a generalization of finite differences, where, instead of being restricted to the coordinate directions, function evaluations along a general set of directions are utilized, and derivative approximations are found by solving an associated system of equations (i.e., minimizing the interpolation model). Linear models lead to approximations to the gradient [4], while quadratic models provide approximations to the gradient and Hessian [8]. Much research has focused on the ‘quality’ of the set of directions, because this impacts the accuracy of the derivative estimates [2, 6, 7, 16].

The purpose of this work is to show that, if one uses a (diagonal) quadratic interpolation model with carefully chosen interpolation directions and 2n+22n+2 function values, then approximations to the gradient and diagonal of the Hessian can be obtained in 𝒪(n)\mathcal{O}(n) computations and with only 𝒪(1)\mathcal{O}(1) vectors of storage. The solution to this specific instance of an interpolation model can be thought of, in some sense, as a general finite difference type approximation to the gradient and diagonal of the Hessian, because the solution to the interpolation model is a formula that involves sums/differences of function values at the interpolation points. Although the focus of this work is on derivative estimates and not algorithmic development, the estimates here are widely applicable, and could be employed by many algorithms that require gradient approximations. Moreover, the diagonal Hessian approximation is readily available, and could be used as preconditioner within an algorithm. (See the numerical experiments in Section 6, which include a derivative free preconditioned conjugate gradients algorithm.)

It remains to note that the approaches mentioned above all require access to function values, and the cost of evaluating a function in a derivative free optimization setting can vary greatly depending upon the particular application being considered. It is up to the user to determine their function evaluation budget, and the approaches in this work require between n+1n+1 and 2n+22n+2 function evaluations to generate derivative approximations.

1.1 Summary of Contributions

The contributions of this work are listed now.

  1. 1.

    Derivative estimates. This work builds a quadratic interpolation model (with a diagonal approximation to the Hessian), and shows that if a certain set of interpolation directions are used in the quadratic model, with 2n+22n+2 associated function evaluations, then approximations to the gradient g(x)g(x) and diagonal of the Hessian D(x)D(x) can be found in 𝒪(n)\mathcal{O}(n) operations. The set of interpolation directions is referred to as a regular minimal positive basis, and these directions need not be stored in full, but can be computed on-the-fly from 𝒪(1)\mathcal{O}(1) stored vectors.

  2. 2.

    An error bound. We confirm that the gradient approximations developed in this work are 𝒪(h2)\mathcal{O}(h^{2}) accurate, where hh is the ‘sampling radius’. (See Section 4.5.)

  3. 3.

    Application. We provide an example of how the gradient estimates developed in this work could be used in practice. In particular, we take the frame based, preconditioned conjugate gradients algorithm developed in [12], which requires estimates of the gradient and a diagonal preconditioner, and employ the estimates g(x)g(x) and D(x)D(x) developed in this work. (The original paper [12] employed the standard central difference approximations to the gradient and diagonal of the Hessian.)

1.2 Notation

Let e𝐑Ne\in\mathbf{R}^{N} denote the vector of all ones and let ej𝐑Ne_{j}\in\mathbf{R}^{N} for j=1,,Nj=1,\dots,N denote the standard unit basis vectors in 𝐑N\mathbf{R}^{N}, i.e., eje_{j} is the jjth column of the N×NN\times N identity matrix II. (Usually we will take N=nN=n or N=n+1N=n+1 in this work.)

Consider the points x1,,xn+1𝐑nx_{1},\dots,x_{n+1}\in\mathbf{R}^{n}. Define fj:=f(xj)f_{j}:=f(x_{j}) for j=1,,n+1j=1,\dots,n+1, i.e., fjf_{j} denotes the function value at the point xjx_{j}. It is convenient to define the following vectors, which contain the function values:

𝐟+=[𝐟fn+1]𝐑n+1where𝐟:=[f1fn]𝐑n.\mathbf{f}_{+}=\begin{bmatrix}\mathbf{f}\\ f_{n+1}\end{bmatrix}\in\mathbf{R}^{n+1}\quad\text{where}\quad\mathbf{f}:=\begin{bmatrix}f_{1}\\ \vdots\\ f_{n}\end{bmatrix}\in\mathbf{R}^{n}. (3)

Similarly, for the points x1,,xn+1𝐑nx_{1}^{\prime},\dots,x_{n+1}^{\prime}\in\mathbf{R}^{n}, define fj:=f(xj)f_{j}^{\prime}:=f(x_{j}^{\prime}) for all jj, (i.e., fjf_{j}^{\prime} denotes the function value at the point xjx_{j}^{\prime}), and

𝐟+=[𝐟fn+1]𝐑n+1where𝐟:=[f1fn]𝐑n.\mathbf{f}_{+}^{\prime}=\begin{bmatrix}\mathbf{f}^{\prime}\\ f_{n+1}^{\prime}\end{bmatrix}\in\mathbf{R}^{n+1}\quad\text{where}\quad\mathbf{f}^{\prime}:=\begin{bmatrix}f_{1}^{\prime}\\ \vdots\\ f_{n}^{\prime}\end{bmatrix}\in\mathbf{R}^{n}. (4)

It will also be convenient to define the vectors

δ𝐟:=𝐟f(x)e𝐑n,δ𝐟:=𝐟f(x)e𝐑n\delta\mathbf{f}:=\mathbf{f}-f(x)e\in\mathbf{R}^{n},\qquad\delta\mathbf{f}^{\prime}:=\mathbf{f}^{\prime}-f(x)e\in\mathbf{R}^{n} (5)

and

δ𝐟+:=𝐟+f(x)e𝐑n+1,δ𝐟+:=𝐟+f(x)e𝐑n+1.\delta\mathbf{f}_{+}:=\mathbf{f}_{+}-f(x)e\in\mathbf{R}^{n+1},\qquad\delta\mathbf{f}_{+}^{\prime}:=\mathbf{f}_{+}^{\prime}-f(x)e\in\mathbf{R}^{n+1}. (6)

Usually, vectors are assumed to lie in 𝐑n\mathbf{R}^{n}, while the subscript ‘++’ denotes that the vector has an additional entry, so is an element of 𝐑n+1\mathbf{R}^{n+1} (recall the notation above). Matrices are assumed to be elements of 𝐑n×n\mathbf{R}^{n\times n}, while the subscript ‘++’ denotes an additional column, so that the matrix is an element of 𝐑n×(n+1)\mathbf{R}^{n\times(n+1)}.

The symbol ‘\odot’ denotes the Hadamard product, i.e., given x,y𝐑nx,y\in\mathbf{R}^{n}, xy𝐑nx\odot y\in\mathbf{R}^{n} where (xy)i:=xiyi(x\odot y)_{i}:=x_{i}y_{i}.

1.3 Paper Outline

This paper is organised as follows. Section 2 introduces the (diagonal) quadratic interpolation model employed in this work, and shows how derivative approximations can be found using the model. Section 3, describes how approximations to the gradient g(x)g(x) and the diagonal of the Hessian D(x)D(x) can be constructed from the model in 𝒪(n)\mathcal{O}(n) computations when the interpolation directions are chosen to be the coordinate basis, and also when they are chosen to be a ‘regular basis’ (see Section 3.2). Many algorithms in derivative free optimization are based upon simplices, which can be generated from a minimal positive basis. Thus, in Section 4, approximations to the gradient g(x)g(x) and the diagonal of the Hessian D(x)D(x) that can be computed in 𝒪(n)\mathcal{O}(n) are presented when the interpolation points are formed from either a coordinate minimal positive basis, or a regular minimal positive basis (see Section 4.2). It will also be shown that the regular basis and the regular minimal positive basis are closely related to a regular simplex, hence the terminology ‘regular’ (see Section 4.1). In Section 4.5 an error bound is presented to confirm that the gradient approximations developed in this work are 𝒪(h2)\mathcal{O}(h^{2}) accurate, where |h||h| is the sampling radius. Section 5 summarizes existing work on linear models, while numerical experiments are presented in Section 6 to demonstrate the accuracy and applicability of these results.

2 Derivative Estimation via Interpolation Models

This section describes how to approximate derivatives of a function via interpolation models. The Taylor series for a function ff about a point x𝐑nx\in\mathbf{R}^{n} is

f(y)=f(x)+(yx)Tf(x)+12(yx)T2f(x)(yx)+𝒪(yx23).f(y)=f(x)+(y-x)^{T}\nabla f(x)+\tfrac{1}{2}(y-x)^{T}\nabla^{2}f(x)(y-x)+\mathcal{O}(\|y-x\|_{2}^{3}). (7)

Let g(x)g(x) denote an approximation to the gradient f(x)\nabla f(x) (i.e., g(x)f(x)g(x)\approx\nabla f(x)), and let H(x)H(x) denote an approximation to the Hessian 2f(x)\nabla^{2}f(x) (i.e., H(x)2f(x)H(x)\approx\nabla^{2}f(x)). Then (7) becomes

f(y)f(x)+(yx)Tg(x)+12(yx)TH(x)(yx).f(y)\approx f(x)+(y-x)^{T}g(x)+\tfrac{1}{2}(y-x)^{T}H(x)(y-x). (8)

In this work, only a diagonal approximation to the Hessian is considered, so define

D(x):=[d1dn]𝐑n×n,d(x):=[d1dn]𝐑n,D(x):=\begin{bmatrix}d_{1}&&\\ &\ddots&\\ &&d_{n}\end{bmatrix}\in\mathbf{R}^{n\times n},\qquad d(x):=\begin{bmatrix}d_{1}\\ \vdots\\ d_{n}\end{bmatrix}\in\mathbf{R}^{n}, (9)

where D(x)diag(2f(x))D(x)\approx{\rm diag}(\nabla^{2}f(x)). Combining (8) and (9) leads to the (diagonal) quadratic model

m(y)=f(x)+(yx)Tg(x)+12(yx)TD(x)(yx).m(y)=f(x)+(y-x)^{T}g(x)+\tfrac{1}{2}(y-x)^{T}D(x)(y-x). (10)

The motivation for this diagonal quadratic model is as follows. It is well known that, if one uses a linear model with nn (or n+1n+1) interpolation points (i.e., nn or n+1n+1 function evaluations), then an estimate of the gradient is available that is 𝒪(h)\mathcal{O}(h) accurate (see also Section 5 which summarises results for linear models). For some situations, an 𝒪(h)\mathcal{O}(h) accurate gradient is not of sufficient accuracy to make algorithmic progress, so an 𝒪(h2)\mathcal{O}(h^{2}) accurate gradient is used instead. In this case, one can still use a linear model, but 2n2n (or 2n+22n+2) interpolation points (i.e., 2n2n (or 2n+22n+2) function evaluations) are needed. In derivative free optimization, function evaluations are often expensive to obtain, so it is prudent to ‘squeeze out’ as much information from them as possible. Hence, suppose one had available 2n2n (or 2n+22n+2) function evaluations because an accurate gradient was desired. The model (10) contains 2n2n unknowns, so 2n2n interpolation points and function values is enough to uniquely determine an 𝒪(h2)\mathcal{O}(h^{2}) accurate gradient g(x)g(x), as well as an approximation to the pure second derivatives on the diagonal of D(x)D(x), i.e., no additional function evaluations are required to compute the pure second derivatives. Although D(x)D(x) can be determined ‘for free’ in terms of function evaluations (assuming 2n2n are available for the gradient estimate), we will also show that the diagonal matrix D(x)D(x) can be computed in 𝒪(n)\mathcal{O}(n) flops, so the linear algebra costs are low. This is similar to the case for finite differences (i.e., one requires 2n(+1)2n(+1) function evaluations to compute g(x)g(x) via the central differences formula, and those same function values can also be used to compute the pure second derivatives), and so this diagonal quadratic model (10) is useful because one can obtain analogous results for a certain set of interpolation points.

The goal is to determine g(x)g(x) and D(x)D(x) from the interpolation model (10) using a set of known points and the corresponding function values. Hence, one must build a sample set (a set of interpolation points). In general, one can use any number of interpolation points, although this affects the model. For a linear model (omitting the third term in (10)), one can determine a unique solution g(x)g(x) if there are nn interpolation/sample points with known function values at those points (see Section 5). For a general quadratic model with Hessian approximation H(x)H(x), one requires (n+1)(n+2)/2(n+1)(n+2)/2 function values to determine unique solutions g(x)g(x) and H(x)H(x). If one has access to more or fewer sample points/function values, then one can use, for example, a least squares approach to find approximations to g(x)g(x) and H(x)H(x). In the rest of this section, 2n2n interpolation points will be used, and it will become clear that this allows unique solutions g(x)g(x) and D(x)D(x) to the model (10).

Given xx, a set of directions {u1,,un}𝐑n\{u_{1},\dots,u_{n}\}\subset\mathbf{R}^{n}, and a scalar hh (sometimes referred to as the sampling radius), a set of points {x1,,xn}\{x_{1},\dots,x_{n}\} is defined via the equations

xj=x+huj,uj2=1|h|xjx2,j=1,,n.\displaystyle x_{j}=x+hu_{j},\quad\|u_{j}\|_{2}=\tfrac{1}{|h|}\|x_{j}-x\|_{2},\qquad j=1,\dots,n. (11)

The model is constructed to satisfy the interpolation conditions

f(x+huj)=m(x+huj),j=1,,n.f(x+hu_{j})=m(x+hu_{j}),\qquad j=1,\dots,n. (12)

Substituting y=x+huj=xjy=x+hu_{j}=x_{j} j\forall j into the model (10) gives the equations

m(xj)=f(x)+hujTg(x)+12h2ujTD(x)uj.m(x_{j})=f(x)+hu_{j}^{T}g(x)+\tfrac{1}{2}h^{2}u_{j}^{T}D(x)u_{j}. (13)

Because D(x)D(x) is diagonal, ujTD(x)uj=i=1ndi(uj)i2=(ujuj)Td(x),u_{j}^{T}D(x)u_{j}=\sum_{i=1}^{n}d_{i}(u_{j})_{i}^{2}=(u_{j}\odot u_{j})^{T}d(x), j\forall j. Defining

U\displaystyle U :=\displaystyle:= [u1un]\displaystyle\begin{bmatrix}u_{1}&\dots&u_{n}\end{bmatrix} (14)
W\displaystyle W :=\displaystyle:= [u1u1u2u2unun],\displaystyle\begin{bmatrix}u_{1}\odot u_{1}&u_{2}\odot u_{2}&\dots&u_{n}\odot u_{n}\end{bmatrix}, (15)

where U,W𝐑n×nU,W\in\mathbf{R}^{n\times n}, allows (13) to be written in matrix form as

δ𝐟=(5)hUTg(x)+12h2WTd(x).\displaystyle\delta\mathbf{f}\overset{\eqref{eq:df}}{=}hU^{T}g(x)+\tfrac{1}{2}h^{2}W^{T}d(x). (16)

The system (16) is underdetermined, (there are only nn equations in 2n2n unknowns), and as the goal is to find unique solutions g(x)g(x) and D(x)=diag(d(x))D(x)={\rm diag}(d(x)), an additional nn equations are required. Define

h:=ηh,where η1.\displaystyle h^{\prime}:=\eta h,\qquad\text{where }\eta\neq 1. (17)

Now, keep the point xx, and the set of directions {u1,,un}𝐑n\{u_{1},\dots,u_{n}\}\subset\mathbf{R}^{n} fixed and choose hh^{\prime} satisfying (17). A new set of points {x1,,xn}\{x_{1}^{\prime},\dots,x_{n}^{\prime}\} is defined via the equations

xj=x+huj,uj2=1|h|xjx2,j=1,,n.\displaystyle x_{j}^{\prime}=x+h^{\prime}u_{j},\quad\|u_{j}\|_{2}=\tfrac{1}{|h^{\prime}|}\|x_{j}^{\prime}-x\|_{2},\qquad j=1,\dots,n. (18)

The definition of hh^{\prime} in (17) ensures that xjxjx_{j}^{\prime}\neq x_{j} j\forall j. The model is constructed to satisfy the interpolation conditions

f(x+ηhuj)=m(x+ηhuj)j=1,,n.f(x+\eta hu_{j})=m(x+\eta hu_{j})\qquad j=1,\dots,n. (19)

(That is, f(xj)=m(xj)f(x_{j})=m(x_{j}) and f(xj)=m(xj)f(x_{j}^{\prime})=m(x_{j}^{\prime}) for j=1,,nj=1,\dots,n.) Following similar arguments to those previously established, one arrives at the system

δ𝐟\displaystyle\delta\mathbf{f}^{\prime} =(5)\displaystyle\overset{\eqref{eq:df}}{=} hUTg(x)+12(h)2WTd(x)\displaystyle h^{\prime}U^{T}g(x)+\tfrac{1}{2}(h^{\prime})^{2}W^{T}d(x) (20)
=\displaystyle= ηhUTg(x)+12η2h2WTd(x).\displaystyle\eta hU^{T}g(x)+\tfrac{1}{2}\eta^{2}h^{2}W^{T}d(x).

The systems (16) and (20) can be combined into block matrix form as follows:

[δ𝐟δ𝐟]=[hUT12h2WTηhUT12η2h2WT][g(x)d(x)].\displaystyle\begin{bmatrix}\delta\mathbf{f}\\ \delta\mathbf{f}^{\prime}\end{bmatrix}=\begin{bmatrix}hU^{T}&\tfrac{1}{2}h^{2}W^{T}\\ \eta hU^{T}&\tfrac{1}{2}\eta^{2}h^{2}W^{T}\end{bmatrix}\begin{bmatrix}g(x)\\ d(x)\end{bmatrix}. (21)

It is clear that (22) represents a system of 2n2n equations in 2n unknowns. Performing block row operations on the system (22) (i.e., multiplying the first row by η\eta and subtracting the second row, and also multiplying the first row by η2\eta^{2} and subtracting the second row) gives

[ηδ𝐟δ𝐟η2δ𝐟δ𝐟]=[012h2(ηη2)WT(η2η)hUT0][g(x)d(x)].\displaystyle\begin{bmatrix}\eta\delta\mathbf{f}-\delta\mathbf{f}^{\prime}\\ \eta^{2}\delta\mathbf{f}-\delta\mathbf{f}^{\prime}\end{bmatrix}=\begin{bmatrix}0&\tfrac{1}{2}h^{2}(\eta-\eta^{2})W^{T}\\ (\eta^{2}-\eta)hU^{T}&0\end{bmatrix}\begin{bmatrix}g(x)\\ d(x)\end{bmatrix}. (22)

Hence, estimates of the gradient and diagonal of the Hessian can be found by solving

y=hUTg(x),andz=12h2WTd(x),\displaystyle y=hU^{T}g(x),\qquad\text{and}\qquad z=\tfrac{1}{2}h^{2}W^{T}d(x), (23)

where

y\displaystyle y :=\displaystyle:= 1η(η1)(η2δ𝐟δ𝐟)\displaystyle\tfrac{1}{\eta(\eta-1)}(\eta^{2}\delta\mathbf{f}-\delta\mathbf{f}^{\prime}) (24)
z\displaystyle z :=\displaystyle:= 1η(1η)(ηδ𝐟δ𝐟).\displaystyle\tfrac{1}{\eta(1-\eta)}(\eta\delta\mathbf{f}-\delta\mathbf{f}^{\prime}). (25)

Questions naturally arise, such as ‘when do unique solutions to (23) exist?’ and ‘what is the cost of computing the solutions?’. Clearly, if the matrices UU and WW have full rank, then unique solutions to (23) exist. The rank of UU and WW, of course, depends upon how the interpolation points are chosen. If the interpolation directions {u1,,un}\{u_{1},\dots,u_{n}\} form a basis for 𝐑n\mathbf{R}^{n}, then a unique solution g(x)g(x) exists. Similarly, if the vectors wj=ujujw_{j}=u_{j}\odot u_{j}, for j=1,,nj=1,\dots,n form a basis for 𝐑n\mathbf{R}^{n}, then a unique solution d(x)d(x) exists.

Computing the solution to a general unstructured system has a cost of 𝒪(n3)\mathcal{O}(n^{3}) flops. The focus of the remainder of this paper is to show that, if one selects the interpolation points/interpolation directions in a specific way, then unique solutions g(x)g(x) and d(x)d(x) to (23) exist, and moreover, the computational cost of finding the solutions remains low, at 𝒪(n)\mathcal{O}(n). This is the same cost as finite difference approximations to the gradient and pure second order derivatives. Importantly, it will also be shown that this specific set of directions does not increase the memory footprint either, requiring only 𝒪(1)\mathcal{O}(1) vectors of storage.

In what follows, it will be useful to notice that, in the special case η=1\eta=-1, (24) and (25) become

y\displaystyle y =\displaystyle= 12(δ𝐟δ𝐟),[y]i=12(f(x+huj)f(xhuj))\displaystyle\tfrac{1}{2}(\delta\mathbf{f}-\delta\mathbf{f}^{\prime}),\qquad[y]_{i}=\tfrac{1}{2}(f(x+hu_{j})-f(x-hu_{j})) (26)
z\displaystyle z =\displaystyle= 12(δ𝐟+δ𝐟),[z]i=12(f(x+huj)+f(xhuj)2f(x)).\displaystyle\tfrac{1}{2}(\delta\mathbf{f}+\delta\mathbf{f}^{\prime}),\qquad[z]_{i}=\tfrac{1}{2}(f(x+hu_{j})+f(x-hu_{j})-2f(x)). (27)

The Sherman-Morrison formula for a nonsingular matrix A𝐑n×nA\in\mathbf{R}^{n\times n}, and vectors u,v𝐑nu,v\in\mathbf{R}^{n} is used many times in this work, and is stated now for convenience:

(A+uvT)1=A1A1uvTA11+vTA1u.(A+uv^{T})^{-1}=A^{-1}-\frac{A^{-1}uv^{T}A^{-1}}{1+v^{T}A^{-1}u}. (28)

3 Computing derivative estimates via interpolation

This section describes how to compute solutions to the diagonal quadratic interpolation model when the coordinate basis is used to construct the interpolation points, and also when the regular basis is employed. In both cases the computational cost of determining the solution is 𝒪(n)\mathcal{O}(n), and when the coordinate basis is used with η=1\eta=-1 in (17), the usual finite differences approximations to the first and second (pure) derivatives are recovered.

3.1 Interpolation using the coordinate basis

Here we present the solutions g(x)g(x) and d(x)d(x) to (23) when the coordinate basis {e1,,en}𝐑n\{e_{1},\dots,e_{n}\}\subset\mathbf{R}^{n} is used to generate the interpolation points. Given a point x𝐑nx\in\mathbf{R}^{n}, take uj=eju_{j}=e_{j} j\forall j, so that UIU\equiv I. Because ejej=ej,je_{j}\odot e_{j}=e_{j},\;\;\forall j, WIW\equiv I. Subsequently, the points x1,,xn,x1,,xnx_{1},\dots,x_{n},x_{1}^{\prime},\dots,x_{n}^{\prime} satisfy xj=x+hejx_{j}=x+he_{j}, xj=x+ηhejx_{j}^{\prime}=x+\eta he_{j}, j=1,nj=1,\dots n. Substituting this into (23), shows that g(x)=1hyg(x)=\tfrac{1}{h}y and d(x)=2h2zd(x)=\tfrac{2}{h^{2}}z. Now, set η=1\eta=-1. Using (26) in the expression for g(x)g(x) gives

g(x)=12h(δ𝐟δ𝐟)=12h(𝐟𝐟),\displaystyle g(x)=\tfrac{1}{2h}(\delta\mathbf{f}-\delta\mathbf{f}^{\prime})=\tfrac{1}{2h}(\mathbf{f}-\mathbf{f}^{\prime}),

or equivalently, the iith element of g(x)g(x) is

[g(x)]i=12h(f(x+hei)f(xhei)).\displaystyle[g(x)]_{i}=\tfrac{1}{2h}(f(x+he_{i})-f(x-he_{i})). (29)

Notice that (29) is the standard central differences approximation to the gradient, which is known to be 𝒪(h2)\mathcal{O}(h^{2}) accurate (see also Section 4.5).

Similarly, substituting (27) in the expression for d(x)d(x) gives

d(x)=1h2(δ𝐟+δ𝐟)=1h2(𝐟+𝐟2f(x)e),\displaystyle d(x)=\tfrac{1}{h^{2}}\left(\delta\mathbf{f}+\delta\mathbf{f}^{\prime}\right)=\tfrac{1}{h^{2}}\left(\mathbf{f}+\mathbf{f}^{\prime}-2f(x)e\right),

or equivalently, the iith element of d(x)d(x) is

[d(x)]i=1h2(f(x+hei)+f(xhei)2f(x)),\displaystyle[d(x)]_{i}=\tfrac{1}{h^{2}}\left(f(x+he_{i})+f(x-he_{i})-2f(x)\right), (30)

which is the standard, central differences approximation to the diagonal entries of the Hessian.

This confirms that when the coordinate basis is used in the diagonal quadratic interpolation model, with h=hh^{\prime}=-h, then (23) recovers the central finite differences formulas for the gradient and pure second order derivatives.

3.2 Interpolation using a regular basis

This section shows how gradient and diagonal Hessian estimates can be computed in 𝒪(n)\mathcal{O}(n) flops if a specific basis is used to generate the interpolation points. For ease of reference, we call this a regular basis. (It will be shown that the directions forming the regular basis are nn of the internal arms of a regular simplex, see Section 4).

Define the scalars

α:=n+1n,γ:=1n(11n+1).\displaystyle\alpha:=\sqrt{\frac{n+1}{n}},\qquad\gamma:=\frac{1}{n}\left(1-\sqrt{\frac{1}{n+1}}\right). (31)

Let {v1,,vn}\{v_{1},\dots,v_{n}\} be the regular basis for 𝐑n\mathbf{R}^{n}, whose elements are defined by

vj:=α(ejγe),j=1,,n,v_{j}:=\alpha(e_{j}-\gamma e),\qquad j=1,\dots,n, (32)

where the basis vectors can be collected as the columns of the matrix

V:=[v1vn]=α(IγeeT)𝐑n×n.\displaystyle V:=\begin{bmatrix}v_{1}&\dots&v_{n}\end{bmatrix}=\alpha(I-\gamma ee^{T})\in\mathbf{R}^{n\times n}. (33)

Equation (32) shows that the vectors vjv_{j} are simply a multiple of ee (the vector of all ones) with an adjustment to the jjth component. Thus, these vectors need not be stored explicitly; they can simply be generated on-the-fly and then discarded. Therefore, the regular basis has a low storage footprint of 𝒪(1)\mathcal{O}(1) vectors.

In [13, p.569], (see also page 20 and Corollary 2.6 in [10]) it is shown that for VV defined in (33),

V2=VTV=[11n1n1n11n1n1n1],\displaystyle V^{2}=V^{T}V=\begin{bmatrix}1&-\frac{1}{n}&\dots&-\frac{1}{n}\\ -\frac{1}{n}&1&&\vdots\\ \vdots&&\ddots&-\frac{1}{n}\\ -\frac{1}{n}&\dots&-\frac{1}{n}&1\end{bmatrix},

which shows that the elements of the regular basis have unit length:

vj2=1,j=1,,n.\|v_{j}\|_{2}=1,\qquad j=1,\dots,n. (34)

The following Lemma confirms that {v1,,vn}\{v_{1},\dots,v_{n}\} is a basis for 𝐑n\mathbf{R}^{n}, and also that VV is positive definite.

Lemma 1 (Lemma 2 in [13]).

Let α\alpha and γ\gamma be defined in (31). Then VV in (33) is nonsingular. Moreover, VV has n1n-1 eigenvalues equal to α\alpha, and one eigenvalue equal to 1/n1/\sqrt{n}.

Corollary 2.

Let the conditions of Lemma 1 hold. Then V2=α\|V\|_{2}=\alpha and V12=n\|V^{-1}\|_{2}=\sqrt{n}.

Now, a set of interpolation points are generated using the regular basis. Fix xx, hh and η\eta. Take VV as in (33), and define {x1,,xn,x1,,xn}\{x_{1},\dots,x_{n},x_{1}^{\prime},\dots,x_{n}^{\prime}\} via

xj=x+hvj,xj=x+ηhvj,j=1,,n.\displaystyle x_{j}=x+hv_{j},\qquad x_{j}^{\prime}=x+\eta hv_{j},\qquad j=1,\dots,n. (35)

Note that, because vj=1\|v_{j}\|=1 for all jj by (34), the points {x1,,xn}\{x_{1},\dots,x_{n}\} all lie on a circle of radius |h||h|, and the points {x1,,xn}\{x_{1}^{\prime},\dots,x_{n}^{\prime}\} all lie on a circle of radius |ηh||\eta h|. Figure 1 provides an illustration showing the coordinate basis and regular basis in 𝐑2\mathbf{R}^{2}, and how they can be used to generate the interpolation points when η=1\eta=-1 (i.e., h=hh^{\prime}=-h).

xxe1e_{1}e2e_{2}e1-e_{1}e2-e_{2}xxx1x_{1}x2x_{2}x1x_{1}^{\prime}x2x_{2}^{\prime}
(a) The coordinate basis in 𝐑2\mathbf{R}^{2}
xxv1v_{1}v2v_{2}v1-v_{1}v2-v_{2}xxx1x_{1}x1x_{1}^{\prime}x2x_{2}x2x_{2}^{\prime}
(b) The regular basis in 𝐑2\mathbf{R}^{2}.
Figure 1: Left: The standard coordinate basis {e1,e2}𝐑2\{e_{1},e_{2}\}\subset\mathbf{R}^{2} and the directions e1,e2-e_{1},-e_{2}. Also the interpolation points xj=x+hejx_{j}=x+he_{j}, j{1,2}j\in\{1,2\} and xj=x+hejx_{j}^{\prime}=x+h^{\prime}e_{j}, j{1,2}j\in\{1,2\} for h=hh^{\prime}=-h. Right: The regular basis {v1,v2}𝐑2\{v_{1},v_{2}\}\subset\mathbf{R}^{2} and the directions v1,v2-v_{1},-v_{2}. Also the interpolation points xj=x+hvjx_{j}=x+hv_{j}, j{1,2}j\in\{1,2\} and xj=x+hvjx_{j}^{\prime}=x+h^{\prime}v_{j}, j{1,2}j\in\{1,2\} for h=hh^{\prime}=-h.

We are now ready to state the main results of this section, which show that estimates of g(x)g(x) and d(x)d(x) can be obtained in 𝒪(n)\mathcal{O}(n) computations when using the model (13) and a regular basis (33).

Theorem 3.

Let α\alpha and γ\gamma be defined in (31), let VV be as defined in (33), let yy be defined in (24), and let (17) hold. Then the solution to (23) is

g(x)=1αh(y+1n(n+11)(yTe)e),\displaystyle g(x)=\tfrac{1}{\alpha h}(y+\tfrac{1}{n}(\sqrt{n+1}-1)(y^{T}e)e), (36)

which is an 𝒪(n)\mathcal{O}(n) computation.

Proof.

By Lemma 1, VV is nonsingular. Because

1nγ=(31)1n1n(11n+1)=1n+1,andγn+1=1n(n+11),\displaystyle 1-n\gamma\overset{\eqref{eq:alphagamma}}{=}1-n\tfrac{1}{n}(1-\tfrac{1}{\sqrt{n+1}})=\tfrac{1}{\sqrt{n+1}},\quad\text{and}\quad\gamma\sqrt{n+1}=\tfrac{1}{n}(\sqrt{n+1}-1), (37)

applying the Sherman-Morrison-Woodbury formula (28) to VV in (33) gives

V1=1α(I+γ1nγeeT)=(37)1α(I+γn+1eeT)=(37)1α(I+1n(n+11)eeT).\displaystyle V^{-1}=\tfrac{1}{\alpha}(I+\tfrac{\gamma}{1-n\gamma}ee^{T})\overset{\eqref{eq:gammaineq1}}{=}\tfrac{1}{\alpha}(I+\gamma\sqrt{n+1}\;ee^{T})\overset{\eqref{eq:gammaineq1}}{=}\tfrac{1}{\alpha}(I+\tfrac{1}{n}(\sqrt{n+1}-1)ee^{T}). (38)

Hence, taking UVU\equiv V in (23) gives

g(x)=(23)1hV1y=(38)1αh(I+1n(n+11)eeT)y=1αh(y+1n(n+11)(eTy)e).\displaystyle g(x)\overset{\eqref{eq:Findgd}}{=}\tfrac{1}{h}V^{-1}y\overset{\eqref{eq:iVtemp}}{=}\tfrac{1}{\alpha h}(I+\tfrac{1}{n}(\sqrt{n+1}-1)ee^{T})y=\tfrac{1}{\alpha h}(y+\tfrac{1}{n}(\sqrt{n+1}-1)(e^{T}y)e).

Lastly, (36) depends upon operations involving vectors only (inner and scalar products and vector additions) so that g(x)g(x) is an 𝒪(n)\mathcal{O}(n) computation. ∎

Corollary 4.

Let the assumptions of Theorem 3 hold, and let η=1\eta=-1 so that yy is as in (26). Setting f~=1ni=1n(f(x+hvi)f(xhvi)),\tilde{f}=\frac{1}{n}\sum_{i=1}^{n}(f(x+hv_{i})-f(x-hv_{i})), gives

g(x)=12αh(𝐟𝐟+(n+11)f~e).\displaystyle g(x)=\tfrac{1}{2\alpha h}(\mathbf{f}-\mathbf{f}^{\prime}+(\sqrt{n+1}-1)\tilde{f}e).

Equivalently, the iith component of g(x)g(x) is

[g(x)]i=12αh(f(x+hvi)f(xhvi)+(n+11)f~).\displaystyle[g(x)]_{i}=\tfrac{1}{2\alpha h}(f(x+hv_{i})-f(x-hv_{i})+(\sqrt{n+1}-1)\tilde{f}). (39)

Theorem 3 and Corollary 4 show that using a regular basis in the quadratic interpolation model leads to an estimate for g(x)g(x) that is similar to the central finite differences approximation in (29). The formula (29) uses a factor (2h)1(2h)^{-1}, while the formula (39) uses a factor (2αh)1(2\alpha h)^{-1}. Both (29) and (39) involve the difference between the function values measured along the positive and negative basis directions. However, (39) also contains a ‘correction’ term, which involves the function values measured at all the vertices. Lastly, note that Theorem 3 requires knowledge of f(x)f(x), but in the special case η=1\eta=-1, Corollary 4 shows that f(x)f(x) is not used.

Before presenting the next theorem, the following parameters are defined. Let

μ:=α2(12γ),ω:=γ2/(12γ).\mu:=\alpha^{2}(1-2\gamma),\qquad\omega:=\gamma^{2}/(1-2\gamma). (40)
Remark 5.

Note that 12γ1-2\gamma is monotonically increasing in nn. Moreover, n2n\geq 2, and when n=2n=2, 12γ=1212(113)=13>01-2\gamma=1-2\frac{1}{2}(1-\frac{1}{\sqrt{3}})=\frac{1}{\sqrt{3}}>0, so 12γ01-2\gamma\neq 0 for any n2n\geq 2.

Theorem 6.

Let α\alpha and γ\gamma be defined in (31), let μ\mu and ω\omega be defined in (40), let VV be defined in (33), let zz and WW be defined in (25) and (15) respectively, and let (17) hold. Then the solution to (23) is

d(x)=2μh2(z1n(1μ)(eTz)e),d(x)=\tfrac{2}{\mu h^{2}}\left(z-\tfrac{1}{n}(1-\mu)(e^{T}z)e\right), (41)

which is an 𝒪(n)\mathcal{O}(n) computation.

Proof.

From (32) it can be seen that

(vj)i2=(vjvj)i={α2(1γ)2if j=i,α2γ2if ji.(v_{j})_{i}^{2}=(v_{j}\odot v_{j})_{i}=\begin{cases}\alpha^{2}(1-\gamma)^{2}&\text{if }j=i,\\ \alpha^{2}\gamma^{2}&\text{if }j\neq i.\end{cases} (42)

Thus,

W=(42)α2[(1γ)2γ2γ2γ2(1γ)2γ2γ2γ2(1γ)2]=(40)μ(I+ωeeT),W\overset{\eqref{vjsquared}}{=}\alpha^{2}\begin{bmatrix}(1-\gamma)^{2}&\gamma^{2}&\dots&\gamma^{2}\\ \gamma^{2}&(1-\gamma)^{2}&\dots&\gamma^{2}\\ \vdots&&\ddots&\vdots\\ \gamma^{2}&\gamma^{2}&\dots&(1-\gamma)^{2}\end{bmatrix}\overset{\eqref{eq:muom}}{=}\mu(I+\omega ee^{T}), (43)

Applying the Sherman-Morrison-Woodbury formula from (28), to WW in (43), gives

W1=1μ(Iω1+nωeeT)=1μ(Iωn1+nω1neeT)=(94)1μ(I1n(1μ)eeT).W^{-1}=\tfrac{1}{\mu}\left(I-\tfrac{\omega}{1+n\omega}ee^{T}\right)=\tfrac{1}{\mu}\left(I-\tfrac{\omega n}{1+n\omega}\tfrac{1}{n}ee^{T}\right)\overset{\eqref{eq:muvsomegan}}{=}\tfrac{1}{\mu}\left(I-\tfrac{1}{n}(1-\mu)ee^{T}\right). (44)

Thus

d(x)=(23)2h2W1z=(44)2μh2(I1n(1μ)eeT)z=2μh2(z1n(1μ)(eTz)e).\displaystyle d(x)\overset{\eqref{eq:Findgd}}{=}\tfrac{2}{h^{2}}W^{-1}z\overset{\eqref{eq:Winv}}{=}\tfrac{2}{\mu h^{2}}\left(I-\tfrac{1}{n}(1-\mu)ee^{T}\right)z=\tfrac{2}{\mu h^{2}}\left(z-\tfrac{1}{n}(1-\mu)(e^{T}z)e\right).

Finally, (41) depends upon operations involving vectors only (inner and scalar products and vector additions) so that determining d(x)d(x) is an 𝒪(n)\mathcal{O}(n) computation. ∎

Corollary 7.

Let the assumptions of Theorem 6 hold, and let η=1\eta=-1 so that zz is as in (27). Setting f¯=1ni=1n(f(x+hvi)+f(xhvi)2f(x))\bar{f}=\tfrac{1}{n}\sum_{i=1}^{n}(f(x+hv_{i})+f(x-hv_{i})-2f(x)) gives

d(x)=1μh2(δ𝐟+δ𝐟+(1μ)f¯(x)e),\displaystyle d(x)=\tfrac{1}{\mu h^{2}}(\delta\mathbf{f}+\delta\mathbf{f}^{\prime}+(1-\mu)\bar{f}(x)e),

or equivalently, the iith component of gg is

[d(x)]i=1μh2(f(x+hvi)+f(xhvi)2f(x)+(1μ)f¯(x)).\displaystyle[d(x)]_{i}=\tfrac{1}{\mu h^{2}}(f(x+hv_{i})+f(x-hv_{i})-2f(x)+(1-\mu)\bar{f}(x)). (45)

Theorem 6 and Corollary 7 show that using a regular basis in the quadratic interpolation model leads to an estimate for d(x)d(x) that is similar to the central finite differences approximation in (30). The formula (30) uses a factor 1/h21/h^{2}, whereas (45) uses a factor 1/μh21/\mu h^{2}. However, (39) also contains a ‘correction’ term, which is a kind of weighted average of the function values measured at the interpolation points.

4 Interpolation using minimal positive bases

Spendley et al. [32] and Nelder and Mead [28] carried out some of the pioneering work on derivative-free simplex-based algorithms. This work continues today with studies on properties of simplicies [1, 22] as well as work on simplex algorithm development, [31]. These, and many other algorithms in derivative free optimization, employ positive bases and simplices, and involve geometric operations such as expansions, rotations and contractions. In such cases, the original simplex/positive basis involves n+1n+1 (or possibly more) points (with their corresponding function values), and if a geometric operation is performed, then this often doubles the number of interpolation points and corresponding function values that are available. As previously mentioned, function values are expensive, so it is prudent to make full use of them.

This section is motivated in part, by the ongoing research on simplex based methods, and the following types of ideas. Consider a simplex based algorithm, where one arrives at an iteration that involves an expansion step. In such a case, one would have access to 2n+22n+2 function values (measured at the vertices of the 2 resulting original and expanded simplices) as a by-product. If an approximation to the gradient or diagonal of the Hessian was readily available, then this might be useful curvature information that could be built into a simplex based algorithm to help locate the solution more quickly. Moreover, depending on the type of geometric operation involved, the vertices of the 2 simplices are generated from the same set of base directions, one scaled by hh, and one scaled by some h=ηhh^{\prime}=\eta h. Therefore, in this work we have not restricted to η=1\eta=-1, to ensure the results are applicable to expansions, contractions, and not just rotations.

In this section we investigate the use of minimal positive bases (and their associated simplices) to generate interpolation points that can be used in the model (13). We will show that the results from the previous section carry over, in a natural way, when there are 2n+22n+2 (rather than 2n2n) interpolation points.

4.1 Positive Bases and Simplices

This section begins with several preliminaries, (see also, for example, [10, 14, 13]).

Definition 8 (Positive Basis).

A set of vectors {u1,,uN}𝐑n\{u_{1},\dots,u_{N}\}\subset\mathbf{R}^{n} is called a positive basis for 𝐑n\mathbf{R}^{n} if and only if the following two conditions hold.

  • (i)

    Every vector in 𝐑n\mathbf{R}^{n} is a nonnegative linear combination of the members of {u1,,uN}\{u_{1},\dots,u_{N}\}, i.e., u𝐑n\forall\;u\in\mathbf{R}^{n} it holds that {u𝐑n:u=c1u1++cNuN, 0ci𝐑,i=1,,N}.\{u\in\mathbf{R}^{n}:u=c_{1}u_{1}+\dots+c_{N}u_{N},\;0\leq c_{i}\in\mathbf{R},i=1,\dots,N\}.

  • (ii)

    No proper subset of {u1,,uN}\{u_{1},\dots,u_{N}\} satisfies (i).

Definition 9 (Minimal and Maximal Positive Bases).

A positive basis with n+1n+1 elements is called a minimal positive basis. A positive basis with 2n2n elements is called a maximal positive basis.

Lemma 10 (Minimal Positive Basis).

Let U=[u1un]U=\begin{bmatrix}u_{1}&\dots&u_{n}\end{bmatrix} be a nonsingular matrix. Then the set {u1,,un,Ue}𝐑n\{u_{1},\dots,u_{n},-Ue\}\subset\mathbf{R}^{n} is a minimal positive basis for 𝐑n\mathbf{R}^{n}.

Lemma 10 shows that the set {e1,,en,e}\{e_{1},\dots,e_{n},-e\} is a minimal positive basis for 𝐑n\mathbf{R}^{n}. Lemma 10 also shows that the columns of

V+:=[VVe]𝐑n×(n+1),V_{+}:=\begin{bmatrix}V&-Ve\end{bmatrix}\in\mathbf{R}^{n\times(n+1)}, (46)

where VV is defined in (33), form a minimal positive basis for 𝐑n\mathbf{R}^{n} (recall that VV is nonsingular by Lemma 1). It can be shown that (see also Lemma 3 in [13]),

vn+1=Ve=i=1nvj=(33)+(46)1ne.v_{n+1}=-Ve=-\sum_{i=1}^{n}v_{j}\overset{\eqref{eq:V}+\eqref{eq:Vp}}{=}-\frac{1}{\sqrt{n}}e. (47)

The minimal positive basis {v1,,vn+1}\{v_{1},\dots,v_{n+1}\} has many useful properties and it is shown in [13] that the angle between any pair of elements of the minimal positive basis is the same, while (47) shows that one of the directions is aligned with the vector e-e. Henceforth, we refer to the columns of V+V_{+} in (46) as a regular minimal positive basis.

Definition 11 (Affine independence).

A set of n+1n+1 points x1,xn+1𝐑nx_{1}\dots,x_{n+1}\in\mathbf{R}^{n} is called affinely independent if the vectors x1xn+1,,xnxn+1x_{1}-x_{n+1},\dots,x_{n}-x_{n+1} are linearly independent.

Definition 12 (Simplex).

The convex hull of a set of affinely independent points {x1,xn+1}𝐑n\{x_{1}\dots,x_{n+1}\}\subset\mathbf{R}^{n} is called a simplex of dimension nn.

Definition 13 (Regular Simplex).

A regular simplex is a simplex that is also a regular polytope.

Given a point xx, and the regular minimal positive basis {v1,,vn+1}\{v_{1},\dots,v_{n+1}\}, the sets of points {x1,,xn+1}\{x_{1},\dots,x_{n+1}\} and {x1,,xn+1}\{x_{1}^{\prime},\dots,x_{n+1}^{\prime}\} formed via

xj=x+hvj,xj=x+hvj,j=1,,n+1,x_{j}=x+hv_{j},\qquad x_{j}^{\prime}=x+h^{\prime}v_{j},\qquad j=1,\dots,n+1, (48)

are affinely independent. Hence, the convex hull of {x1,,xn+1}\{x_{1},\dots,x_{n+1}\} is a regular simplex aligned in the direction e-e, and so too is the the convex hull of {x1,,xn+1}\{x_{1}^{\prime},\dots,x_{n+1}^{\prime}\}. This explains why we have adopted the terminology ‘regular’, and refer to the columns of VV as a regular basis, and the columns of V+V_{+} as a regular minimal positive basis; both are related to a regular simplex (see Section 3.1 in [13]).

e1e_{1}e2e_{2}e-ex1x_{1}x2x_{2}x3x_{3}
(a) Coordinate minimal positive basis.
v1v_{1}v2v_{2}v3v_{3}x1x_{1}x2x_{2}x3x_{3}
(b) Regular minimal positive basis
Figure 2: An illustration of minimal positive bases in 𝐑2\mathbf{R}^{2}. Left: The coordinate minimal positive basis {e1,e2,e}𝐑2\{e_{1},e_{2},-e\}\subset\mathbf{R}^{2}, the points x1=x+he1x_{1}=x+he_{1}, x2=x+he2x_{2}=x+he_{2}, and x3=xhex_{3}=x-he, as well as the simplex formed by taking the convex hull of {x1,x2,x3}\{x_{1},x_{2},x_{3}\}. Right: The regular minimal positive basis {v1,v2,v3}𝐑2\{v_{1},v_{2},v_{3}\}\subset\mathbf{R}^{2}, the points x1=x+hv1x_{1}=x+hv_{1}, x2=x+hv2x_{2}=x+hv_{2}, and x3=x+hv3x_{3}=x+hv_{3}, as well as the simplex formed by taking the convex hull of {x1,x2,x3}\{x_{1},x_{2},x_{3}\}.
xxx1x_{1}^{\prime}x2x_{2}^{\prime}x3x_{3}^{\prime}x1x_{1}x2x_{2}x3x_{3}
(a) 0<h<h0<h<h^{\prime}
xxx1x_{1}^{\prime}x2x_{2}^{\prime}x3x_{3}^{\prime}x1x_{1}x2x_{2}x3x_{3}
(b) h<0<hh^{\prime}<0<h
xxx1x_{1}^{\prime}x2x_{2}^{\prime}x3x_{3}^{\prime}x1x_{1}x2x_{2}x3x_{3}
(c) h<h<0h<h^{\prime}<0
Figure 3: A schematic of regular simplices and the results of geometric operations (expansion/contraction/rotation).

4.2 Interpolation using a minimal positive basis

In this section we show that estimates of the gradient and diagonal Hessian can be obtained in 𝒪(n)\mathcal{O}(n) computations when a regular minimal positive basis is used to generate the interpolation points. Hence, let {u1,,un}\{u_{1},\dots,u_{n}\} be a basis for 𝐑n\mathbf{R}^{n}, so that U=[u1,,un]U=\begin{bmatrix}u_{1},\dots,u_{n}\end{bmatrix} is nonsingular. Moreover, let U+=[UUe]U_{+}=\begin{bmatrix}U&-Ue\end{bmatrix}, so that, by Lemma 10 the columns of U+U_{+} are a minimal positive basis for 𝐑n\mathbf{R}^{n}. Let WW be as in (15) and define

W+:=[Wun+1un+1]𝐑n×(n+1).W_{+}:=\begin{bmatrix}W&u_{n+1}\odot u_{n+1}\end{bmatrix}\in\mathbf{R}^{n\times(n+1)}. (49)

Following the same arguments as those presented in Section 2, estimates to the gradient and diagonal of the Hessian can be found by solving

y+=hU+Tg(x),andz+=12h2W+Td(x),\displaystyle y_{+}=hU_{+}^{T}g(x),\qquad\text{and}\qquad z_{+}=\tfrac{1}{2}h^{2}W_{+}^{T}d(x), (50)

where

y+=[yyn+1]:=1η(1η)(η2δ𝐟+δ𝐟+)𝐑n+1\displaystyle y_{+}=\begin{bmatrix}y\\ y_{n+1}\end{bmatrix}:=-\frac{1}{\eta(1-\eta)}\left(\eta^{2}\delta\mathbf{f}_{+}-\delta\mathbf{f}_{+}^{\prime}\right)\in\mathbf{R}^{n+1} (51)

and

z+=[zzn+1]:=1η(1η)(ηδ𝐟+δ𝐟+)𝐑n+1.\displaystyle z_{+}=\begin{bmatrix}z\\ z_{n+1}\end{bmatrix}:=\frac{1}{\eta(1-\eta)}\left(\eta\delta\mathbf{f}_{+}-\delta\mathbf{f}_{+}^{\prime}\right)\in\mathbf{R}^{n+1}. (52)

Clearly, the matrices in (50) are not square, so that g(x)g(x) and d(x)d(x) will be computed as least squares solutions to (50) in this section.

4.3 Interpolation using the coordinate minimal positive basis

Here, take U=IU=I, so that U+=I+=[Ie]U_{+}=I_{+}=\begin{bmatrix}I&-e\end{bmatrix}, which corresponds to the coordinate minimal positive basis {e1,,en,e}\{e_{1},\dots,e_{n},-e\}. The following results hold.

Theorem 14.

Let y+y_{+} be defined in (51). Then, using the coordinate minimal basis {e1,,en,e}\{e_{1},\dots,e_{n},-e\}, the least squares solution to (50) is

g(x)=1h(y1n+1(eTy+)e).\displaystyle g(x)=\tfrac{1}{h}(y-\tfrac{1}{n+1}(e^{T}y_{+})e).
Proof.

Note that

I+I+T=[Ie][IeT]=I+eeT,(I+I+T)1=(28)I1n+1eeT,\displaystyle I_{+}I_{+}^{T}=\begin{bmatrix}I&-e\end{bmatrix}\begin{bmatrix}I\\ -e^{T}\end{bmatrix}=I+ee^{T},\qquad(I_{+}I_{+}^{T})^{-1}\overset{\eqref{eq:SMW}}{=}I-\tfrac{1}{n+1}ee^{T}, (53)

and

I+y+=[Ie][yyn+1]=yyn+1e.\displaystyle I_{+}y_{+}=\begin{bmatrix}I&-e\end{bmatrix}\begin{bmatrix}y\\ y_{n+1}\end{bmatrix}=y-y_{n+1}e. (54)

Combining the previous two gives shows that the least squares solution to (50) is

g(x)\displaystyle g(x) =\displaystyle= 1h(I+I+T)1I+y+\displaystyle\tfrac{1}{h}(I_{+}I_{+}^{T})^{-1}I_{+}y_{+}
=(53)+(54)\displaystyle\overset{\eqref{eq:Iplusinv}+\eqref{eq:Iyp}}{=} 1h(I1n+1eeT)(yyn+1e)\displaystyle\tfrac{1}{h}(I-\tfrac{1}{n+1}ee^{T})(y-y_{n+1}e)
=\displaystyle= 1h(yyn+1e1n+1(eTyyn+1n)e)\displaystyle\tfrac{1}{h}(y-y_{n+1}e-\tfrac{1}{n+1}(e^{T}y-y_{n+1}n)e)
=\displaystyle= 1h(y1n+1yn+1e1n+1(eTy)e)\displaystyle\tfrac{1}{h}(y-\tfrac{1}{n+1}y_{n+1}e-\tfrac{1}{n+1}(e^{T}y)e)
=\displaystyle= 1h(y1n+1(eTy+)e).\displaystyle\tfrac{1}{h}(y-\tfrac{1}{n+1}(e^{T}y_{+})e).

Corollary 15.

Let the assumptions of Theorem 14 hold. Letting η=1\eta=-1 and letting
f~+=1n+1(f(x+he)f(xhe)+i=1nf(x+hei)f(xhei))\tilde{f}_{+}=\frac{1}{n+1}(f(x+he)-f(x-he)+\sum_{i=1}^{n}f(x+he_{i})-f(x-he_{i})) gives

g(x)=12h(𝐟𝐟f~+e),\displaystyle g(x)=\tfrac{1}{2h}(\mathbf{f}-\mathbf{f}^{\prime}-\tilde{f}_{+}e),

or equivalently

[g(x)]i=12h(f(x+hei)f(xhei)f~+).\displaystyle[g(x)]_{i}=\tfrac{1}{2h}(f(x+he_{i})-f(x-he_{i})-\tilde{f}_{+}).

Corollary 15 shows that when the coordinate minimal positive basis is used in the quadratic interpolation model, the expression for g(x)g(x) is similar to the central differences formula, except an additional shift term f~+\tilde{f}_{+} that involves the function values at all the interpolation points, is also added to each element.

Theorem 16.

Let z+z_{+} be defined in (52). Then, using the coordinate minimal positive basis {e1,,en,e}\{e_{1},\dots,e_{n},-e\}, the least squares solution to (50) is

d(x)=2h2(z1n+1(eTz)e+1n+1zn+1e).\displaystyle d(x)=\tfrac{2}{h^{2}}(z-\tfrac{1}{n+1}(e^{T}z)e+\tfrac{1}{n+1}z_{n+1}e). (55)
Proof.

For the coordinate minimal basis I+=[Ie]I_{+}=\begin{bmatrix}I&-e\end{bmatrix}, so W+=[Ie]W_{+}=\begin{bmatrix}I&e\end{bmatrix} because ee=e-e\odot-e=e. Hence, W+W+T=I+eeTW_{+}W_{+}^{T}=I+ee^{T} and (W+W+T)1=I1n+1eeT(W_{+}W_{+}^{T})^{-1}=I-\tfrac{1}{n+1}ee^{T}, by (53). Similarly to (54), W+z+=z+zn+1eW_{+}z_{+}=z+z_{n+1}e. Combining these shows that the least squares solution to (50) is

d(x)\displaystyle d(x) =\displaystyle= 2h2(W+W+T)1W+z+\displaystyle\tfrac{2}{h^{2}}(W_{+}W_{+}^{T})^{-1}W_{+}z_{+}
=(54)\displaystyle\overset{\eqref{eq:Iyp}}{=} 2h2(I1n+1eeT)(z+zn+1e)\displaystyle\tfrac{2}{h^{2}}(I-\tfrac{1}{n+1}ee^{T})(z+z_{n+1}e)
=\displaystyle= 2h2(z+zn+1e1n+1(eTz+zn+1n)e)\displaystyle\tfrac{2}{h^{2}}(z+z_{n+1}e-\tfrac{1}{n+1}(e^{T}z+z_{n+1}n)e)
=\displaystyle= 2h2(z1n+1(eTz)e+1n+1zn+1e).\displaystyle\tfrac{2}{h^{2}}(z-\tfrac{1}{n+1}(e^{T}z)e+\tfrac{1}{n+1}z_{n+1}e).

Corollary 17.

Let the assumptions of Theorem 14 hold. Setting η=1\eta=-1 and letting
f¯+=1n+1((f(x+he)+f(xhe)2f(x))i=1n(f(x+hei)+f(xhei)2f(x)))\bar{f}_{+}=\frac{1}{n+1}\left((f(x+he)+f(x-he)-2f(x))-\sum_{i=1}^{n}(f(x+he_{i})+f(x-he_{i})-2f(x))\right) gives

d(x)\displaystyle d(x) =\displaystyle= 1h2(𝐟+𝐟2f(x)ef¯+e),\displaystyle\tfrac{1}{h^{2}}(\mathbf{f}+\mathbf{f}^{\prime}-2f(x)e-\bar{f}_{+}e),

or equivalently

[d(x)]i=1h2(f(x+hei)+f(xhei)2f(x)f¯+).\displaystyle[d(x)]_{i}=\tfrac{1}{h^{2}}(f(x+he_{i})+f(x-he_{i})-2f(x)-\bar{f}_{+}).

4.4 Interpolation using a regular minimal positive basis

Here, take U=VU=V with U+=V+U_{+}=V_{+}, where VV is defined in (33). Furthermore, let WW be as in (15) with

W+:=[Wvn+1vn+1]=(47)[W1ne]𝐑n×(n+1).W_{+}:=\begin{bmatrix}W&v_{n+1}\odot v_{n+1}\end{bmatrix}\overset{\eqref{eq:vnp1}}{=}\begin{bmatrix}W&\tfrac{1}{n}e\end{bmatrix}\in\mathbf{R}^{n\times(n+1)}. (56)

We are now ready to present the main results of this section.

Theorem 18.

Let α\alpha and γ\gamma be defined in (31), let V+V_{+} be the regular minimal positive basis in (46), let y+y_{+} be as in (51), and let (17) hold. Then the least squares solution to (50) is

g(x)=1αh(y(γ(eTy)+1n+1yn+1)e),\displaystyle g(x)=\tfrac{1}{\alpha h}\left(y-(\gamma(e^{T}y)+\tfrac{1}{\sqrt{n+1}}y_{n+1})e\right), (57)

which can be computed in 𝒪(n)\mathcal{O}(n).

Proof.

Using (46), it can be shown that V+V+T=α2IV_{+}V_{+}^{T}=\alpha^{2}I (see also Lemma 4 in [13]), so the least squares solution to (50) is g(x)=1h(V+V+T)1V+y+=1α2hV+y+g(x)=\frac{1}{h}(V_{+}V_{+}^{T})^{-1}V_{+}y_{+}=\frac{1}{\alpha^{2}h}V_{+}y_{+}. Combining with

V+y+\displaystyle V_{+}y_{+} =(46)\displaystyle\overset{\eqref{eq:Vp}}{=} [VVe][yyn+1]\displaystyle\begin{bmatrix}V&-Ve\end{bmatrix}\begin{bmatrix}y\\ y_{n+1}\end{bmatrix}
=(47)\displaystyle\overset{\eqref{eq:vnp1}}{=} Vy1nyn+1e\displaystyle Vy-\tfrac{1}{\sqrt{n}}y_{n+1}e
=(46)\displaystyle\overset{\eqref{eq:Vp}}{=} α(yγ(eTy)e)1nyn+1e\displaystyle\alpha(y-\gamma(e^{T}y)e)-\tfrac{1}{\sqrt{n}}y_{n+1}e
=\displaystyle= α(y(γ(eTy)+1αnyn+1)e)\displaystyle\alpha(y-(\gamma(e^{T}y)+\tfrac{1}{\alpha\sqrt{n}}y_{n+1})e)
=(31)\displaystyle\overset{\eqref{eq:alphagamma}}{=} α(y(γ(eTy)+1n+1yn+1)e),\displaystyle\alpha(y-(\gamma(e^{T}y)+\tfrac{1}{\sqrt{n+1}}y_{n+1})e),

gives (57). Finally, (57) depends upon operations involving vectors only; an 𝒪(n)\mathcal{O}(n) computation. ∎

Corollary 19.

Let the assumptions of Theorem 18 hold. Letting η=1\eta=-1 and letting
f~+=1n+1(f(x+hvn+1)f(xhvn+1))+γi=1nf(x+hvi)f(xhvi)\tilde{f}_{+}=\tfrac{1}{\sqrt{n+1}}(f(x+hv_{n+1})-f(x-hv_{n+1}))+\gamma\sum_{i=1}^{n}f(x+hv_{i})-f(x-hv_{i}) gives

g(x)=12αh(𝐟𝐟f~+e),\displaystyle g(x)=\tfrac{1}{2\alpha h}\left(\mathbf{f}-\mathbf{f}^{\prime}-\tilde{f}_{+}e\right), (58)

or equivalently, the iith component of g(x)g(x) is

[g(x)]i=12αh(f(x+hvi)f(xhvi)f~+),\displaystyle[g(x)]_{i}=\tfrac{1}{2\alpha h}\left(f(x+hv_{i})-f(x-hv_{i})-\tilde{f}_{+}\right), (59)

Before presenting the next theorem, define

σ:=2ω+ω2n+1μ2n2.\sigma:=2\omega+\omega^{2}n+\tfrac{1}{\mu^{2}n^{2}}. (60)
Theorem 20.

Let α\alpha and γ\gamma be defined in (31), let μ\mu and ω\omega be defined in (40), and let σ\sigma be defined in (60). Moreover, let V+V_{+} be the oriented regular simplex defined in (46), let z+z_{+} and W+W_{+} be defined in (52) and (56) respectively, and let (17) hold. Then least squares solution to (50) is

d(x)=2μh2(z+11+σn((ωσ)(eTz)+1μnzn+1)e),d(x)=\tfrac{2}{\mu h^{2}}\left(z+\frac{1}{1+\sigma n}\left((\omega-\sigma)(e^{T}z)+\tfrac{1}{\mu n}z_{n+1}\right)e\right), (61)

which can be computed in 𝒪(n)\mathcal{O}(n).

Proof.

Note that

W+W+T\displaystyle W_{+}W_{+}^{T} =(49)\displaystyle\overset{\eqref{eq:Wp}}{=} WWT+1n2eeT\displaystyle WW^{T}+\tfrac{1}{n^{2}}ee^{T} (62)
=(15)\displaystyle\overset{\eqref{eq:W}}{=} μ2(I+ωeeT)2+1n2eeT\displaystyle\mu^{2}(I+\omega ee^{T})^{2}+\tfrac{1}{n^{2}}ee^{T}
=\displaystyle= μ2(I+(2ω+ω2n)eeT)+1n2eeT\displaystyle\mu^{2}(I+(2\omega+\omega^{2}n)ee^{T})+\tfrac{1}{n^{2}}ee^{T}
=(60)\displaystyle\overset{\eqref{eq:sigma}}{=} μ2(I+σeeT).\displaystyle\mu^{2}\left(I+\sigma ee^{T}\right).

Applying the Sherman-Morrison-Woodbury formula (28) to (62) gives

(W+W+T)1=1μ2(Iσ1+σneeT).(W_{+}W_{+}^{T})^{-1}=\tfrac{1}{\mu^{2}}\left(I-\tfrac{\sigma}{1+\sigma n}ee^{T}\right). (63)

Furthermore,

W+z+\displaystyle W_{+}z_{+} =(49)+(52)\displaystyle\overset{\eqref{eq:Wp}+\eqref{eq:zplus}}{=} Wz+1nzn+1e\displaystyle Wz+\tfrac{1}{n}{z}_{n+1}e (64)
=(15)\displaystyle\overset{\eqref{eq:W}}{=} μ(z+ω(eTz)e)+1nzn+1e\displaystyle\mu\left(z+\omega(e^{T}z)e\right)+\tfrac{1}{n}z_{n+1}e
=\displaystyle= μ(z+(ω(eTz)+1nμzn+1)e).\displaystyle\mu\left(z+(\omega(e^{T}z)+\tfrac{1}{n\mu}z_{n+1})e\right).

The least squares solution to (50) is

d\displaystyle d =\displaystyle= 2h2(W+W+T)1W+z+\displaystyle\tfrac{2}{h^{2}}(W_{+}W_{+}^{T})^{-1}W_{+}z_{+}
=(63)+(64)\displaystyle\overset{\eqref{eq:WpWpTinv}+\eqref{eq:Wz}}{=} 2μh2(Iσ1+σneeT)(z+(ω(eTz)+1nμzn+1)e)\displaystyle\tfrac{2}{\mu h^{2}}\left(I-\tfrac{\sigma}{1+\sigma n}ee^{T}\right)\left(z+(\omega(e^{T}z)+\tfrac{1}{n\mu}z_{n+1})e\right)
=\displaystyle= 2μh2(z+(ω(eTz)+zn+1μn)eσ((eTz)+(ω(eTz)+zn+1μn)n)1+σne)\displaystyle\tfrac{2}{\mu h^{2}}\left(z+(\omega(e^{T}z)+\tfrac{z_{n+1}}{\mu n})e-\tfrac{\sigma((e^{T}z)+(\omega(e^{T}z)+\tfrac{z_{n+1}}{\mu n})n)}{1+\sigma n}e\right)
=\displaystyle= 2μh2(z+11+σn((ωσ)(eTz)+1μnzn+1)e).\displaystyle\tfrac{2}{\mu h^{2}}\left(z+\tfrac{1}{1+\sigma n}\left((\omega-\sigma)(e^{T}z)+\tfrac{1}{\mu n}z_{n+1}\right)e\right).

Finally, (61) depends only upon operations involving vectors only (inner and scalar products and vector additions) so that dd is an 𝒪(n)\mathcal{O}(n) computation. ∎

Corollary 21.

Let the assumptions of Theorem 20 hold. Setting η=1\eta=-1 and letting f¯+=1μn(f(x+hvn+1)+f(xhvn+1)2f(x))+(ωσ)i=1n(f(x+hei)+f(xhei)2f(x))\bar{f}_{+}=\tfrac{1}{\mu n}(f(x+hv_{n+1})+f(x-hv_{n+1})-2f(x))+(\omega-\sigma)\sum_{i=1}^{n}(f(x+he_{i})+f(x-he_{i})-2f(x)) gives

d(x)\displaystyle d(x) =\displaystyle= 1h2(𝐟+𝐟2f(x)e11+σnf¯+e).\displaystyle\tfrac{1}{h^{2}}(\mathbf{f}+\mathbf{f}^{\prime}-2f(x)e-\tfrac{1}{1+\sigma n}\bar{f}_{+}e).

4.5 Error bounds for the interpolation model

The following is a standard error bound; the proof is included for completeness.

Theorem 22.

Let {u1,,un+1}\{u_{1},\dots,u_{n+1}\} be a minimal positive basis for 𝐑n\mathbf{R}^{n} and let h𝐑h\in\mathbf{R}. Suppose points {x1,,xn+1,x1,,xn+1}\{x_{1},\dots,x_{n+1},x_{1}^{\prime},\dots,x_{n+1}^{\prime}\} are constructed via

xj=x+huj,xj=xhuj,j=1,,n+1.\displaystyle x_{j}=x+hu_{j},\qquad x_{j}^{\prime}=x-hu_{j},\qquad j=1,\dots,n+1. (65)

Let U+=[u1,,un+1]U_{+}=\begin{bmatrix}u_{1},\dots,u_{n+1}\end{bmatrix} and suppose that ff is continuously differenctiable and that 2f(x)\nabla^{2}f(x) is MM-Lipschitz continuous. Then, the least squares solution g(x)g(x) to (50) satisfies

g(x)f(x)216Mh2U+2n+1.\displaystyle\|g(x)-\nabla f(x)\|_{2}\leq\tfrac{1}{6}Mh^{2}\|U_{+}^{\dagger}\|_{2}\sqrt{n+1}. (66)
Proof.

We begin by noticing that

(U+U+T)(g(x)f(x))\displaystyle(U_{+}U_{+}^{T})(g(x)-\nabla f(x)) =(50)\displaystyle\overset{\eqref{eq:FDgminpb}}{=} 1hU+(y+hU+Tf(x))\displaystyle\tfrac{1}{h}U_{+}(y_{+}-hU_{+}^{T}\nabla f(x)) (67)
=(51)\displaystyle\overset{\eqref{eq:yplus}}{=} 1hU+(12(δ𝐟+δ𝐟+)hU+Tf(x))\displaystyle\tfrac{1}{h}U_{+}(\tfrac{1}{2}(\delta\mathbf{f}_{+}-\delta\mathbf{f}_{+}^{\prime})-hU_{+}^{T}\nabla f(x))

By (6) and (65), the iith element of (67) is

[12(δ𝐟+δ𝐟+)hU+Tf(x)]i\displaystyle[\tfrac{1}{2}(\delta\mathbf{f}_{+}-\delta\mathbf{f}_{+}^{\prime})-hU_{+}^{T}\nabla f(x)]_{i} =\displaystyle= 12((fjf(x))(fjf(x)))(xjx)Tf(x)\displaystyle\tfrac{1}{2}\big{(}(f_{j}-f(x))-(f_{j}^{\prime}-f(x))\big{)}-(x_{j}-x)^{T}\nabla f(x) (68)
=\displaystyle= 12(fjfj)(xjx)Tf(x).\displaystyle\tfrac{1}{2}\big{(}f_{j}-f_{j}^{\prime}\big{)}-(x_{j}-x)^{T}\nabla f(x).

The integral form of the Mean Value Theorem provides the identities

fjf(x)\displaystyle f_{j}-f(x) =\displaystyle= (xjx)T01f(x+t(xjx))𝑑t\displaystyle(x_{j}-x)^{T}\int_{0}^{1}\nabla f(x+t(x_{j}-x))dt (69)
fjf(x)\displaystyle f_{j}^{\prime}-f(x) =\displaystyle= (xjx)T01f(xt(xjx))𝑑t,\displaystyle-(x_{j}-x)^{T}\int_{0}^{1}\nabla f(x-t(x_{j}-x))dt, (70)

because xjx=xxj=(xjx)x_{j}-x=x-x_{j}^{\prime}=-(x_{j}^{\prime}-x), by (65). By [30, p.14], because ff is twice continuously differentiable, for p𝐑np\in\mathbf{R}^{n},

f(x+p)f(x)=012f(x+τp)p𝑑τ.\displaystyle\nabla f(x+p)-\nabla f(x)=\int_{0}^{1}\nabla^{2}f(x+\tau p)p\;\;d\tau. (71)

Combining (69), (70) and (71), and subsequently subtracting (xjx)Tf(x)(x_{j}-x)^{T}\nabla f(x) shows that for each j=1,,nj=1,\dots,n:

12(fjfj)(xjx)Tf(x)\displaystyle\tfrac{1}{2}(f_{j}-f_{j}^{\prime})-(x_{j}-x)^{T}\nabla f(x)
=\displaystyle= 12(xjx)T01(f(x+t(xjx))f(x))+(f(xt(xjx))f(x))dt\displaystyle\tfrac{1}{2}(x_{j}-x)^{T}\int_{0}^{1}\Big{(}\nabla f(x+t(x_{j}-x))-\nabla f(x)\Big{)}+\Big{(}\nabla f(x-t(x_{j}-x))-\nabla f(x)\Big{)}dt
=(71)\displaystyle\overset{\eqref{eq:MVT2}}{=} 12(xjx)T01012f(x+τt(xjx))t(xjx)2f(xτt(xjx))t(xjx)dtdτ\displaystyle\tfrac{1}{2}(x_{j}-x)^{T}\int_{0}^{1}\int_{0}^{1}\nabla^{2}f(x+\tau t(x_{j}-x))t(x_{j}-x)-\nabla^{2}f(x-\tau t(x_{j}-x))t(x_{j}-x)dtd\tau
\displaystyle\leq 12xjx01012f(x+τt(xjx))t(xjx)2f(xτt(xjx))t(xjx)𝑑t𝑑τ\displaystyle\tfrac{1}{2}\|x_{j}-x\|\int_{0}^{1}\int_{0}^{1}\|\nabla^{2}f(x+\tau t(x_{j}-x))t(x_{j}-x)-\nabla^{2}f(x-\tau t(x_{j}-x))t(x_{j}-x)\|dtd\tau
\displaystyle\leq 12xjx01012f(x+τt(xjx))2f(xτt(xjx))t(xjx)𝑑t𝑑τ\displaystyle\tfrac{1}{2}\|x_{j}-x\|\int_{0}^{1}\int_{0}^{1}\|\nabla^{2}f(x+\tau t(x_{j}-x))-\nabla^{2}f(x-\tau t(x_{j}-x))\|\|t(x_{j}-x)\|dtd\tau
\displaystyle\leq xjx201012f(x+τt(xjx))2f(xτt(xjx))t𝑑t𝑑τ\displaystyle\|x_{j}-x\|^{2}\int_{0}^{1}\int_{0}^{1}\|\nabla^{2}f(x+\tau t(x_{j}-x))-\nabla^{2}f(x-\tau t(x_{j}-x))\|tdtd\tau
\displaystyle\leq Mxjx30101τt2𝑑t𝑑τ\displaystyle M\|x_{j}-x\|^{3}\int_{0}^{1}\int_{0}^{1}\tau t^{2}dtd\tau
=\displaystyle= 16Mh3.\displaystyle\tfrac{1}{6}Mh^{3}.

Hence, taking the 2-norm of (67), using the calculation above, and recalling the standard norm inequality n+12\|\cdot\|_{\infty}\leq\sqrt{n+1}\|\cdot\|_{2} (for vectors in 𝐑n+1\mathbf{R}^{n+1}), gives (75). ∎

4.6 Summary for quadratic interpolation models

The following tables summarize our results for diagonal quadratic interpolations models. The formulae for g(x)g(x) are expressed in terms of y+y_{+} defined in (51). The error bounds give κ\kappa, where g(x)f(x)16Mh2κ\|g(x)-\nabla f(x)\|\leq\tfrac{1}{6}Mh^{2}\kappa, and holds when η=1\eta=-1.

g(x)g(x) Error κ\kappa # Samples Theorem
CB 1hy\tfrac{1}{h}y n\sqrt{n} 2n2n
RB 1αh(y+γn+1(yTe)e)\tfrac{1}{\alpha h}(y+\gamma\sqrt{n+1}(y^{T}e)e) nn 2n2n Thm 3
CMPB 1h(y1n+1(eTy+)e)\tfrac{1}{h}(y-\tfrac{1}{n+1}(e^{T}y_{+})e) n+1\sqrt{n+1} 2n+22n+2 Thm 14
RMPB 1αh(y(γ(eTy)+1n+1yn+1)e)\tfrac{1}{\alpha h}\left(y-(\gamma(e^{T}y)+\tfrac{1}{\sqrt{n+1}}y_{n+1})e\right) n\sqrt{n} 2n+22n+2 Thm 18
Table 1: CB = Coordinate Basis, RB = Regular Basis, CMPB = Coordinate Minimal Positive Basis, RMPB = Regular Minimal Positive Basis. Quadratic model summary for the gradients.
d(x)d(x) # Samples Theorem
CB 2h2z\tfrac{2}{h^{2}}z 2n2n
RB 2μh2(z1n(1μ)(eTz)e)\tfrac{2}{\mu h^{2}}\left(z-\tfrac{1}{n}(1-\mu)(e^{T}z)e\right) 2n2n Thm 6
CMPB 2h2(z1n+1(eTz)e+1n+1zn+1e)\tfrac{2}{h^{2}}(z-\tfrac{1}{n+1}(e^{T}z)e+\tfrac{1}{n+1}z_{n+1}e) 2n+22n+2 Thm 16
RMPB 2μh2(z+11+σn((ωσ)(eTz)+1μnzn+1)e)\tfrac{2}{\mu h^{2}}\left(z+\frac{1}{1+\sigma n}\left((\omega-\sigma)(e^{T}z)+\tfrac{1}{\mu n}z_{n+1}\right)e\right) 2n+22n+2 Thm 20
Table 2: CB = Coordinate Basis, RB = Regular Basis, CMPB = Coordinate Minimal Positive Basis, RMPB = Regular Minimal Positive Basis. Quadratic model summary for the diagonal of the Hessian.

We conclude this section by remarking that, in the special case when n=2n=2, if a minimal positive basis is used to generate the points {x1,x2,x3,x1,x2,x3}\{x_{1},x_{2},x_{3},x_{1}^{\prime},x_{2}^{\prime},x_{3}^{\prime}\} then there is enough information to uniquely determine all entries of H(x)H(x) (recall that H(x)2f(x))H(x)\approx\nabla^{2}f(x)), because 2n+2=6=(n+1)(n+2)/22n+2=6=(n+1)(n+2)/2 function values are available, i.e., one can approximate 2f(x)\nabla^{2}f(x) rather than only diag(2f(x)){\rm diag}(\nabla^{2}f(x)). The algebra is straightforward, so is omitted for brevity.

5 Linear Interpolation Models

While quadratic interpolation models lead to accurate approximations to the gradient and diagonal of the Hessian, they require at least 2n2n function evaluations (and more if the full Hessian is desired). For some applications, this cost is too high, and a linear interpolation model is preferred. In this case, one can obtain estimates of the gradient in 𝒪(n)\mathcal{O}(n) computations, and this section details how. As usual, letting g(x)g(x) denote an approximation to the gradient f(x)\nabla f(x), a linear model is

m(y)=f(x)+(yx)Tg(x).m(y)=f(x)+(y-x)^{T}g(x). (72)

Estimations of the gradient g(x)g(x) can be obtained easily from a linear model, although the approximations from such a model (using nn or n+1n+1 function evaluations depending on the interpolation points) are 𝒪(h)\mathcal{O}(h) accurate (in the previous cases with 2n2n or 2n+22n+2 function evaluations, the approximations are 𝒪(h2)\mathcal{O}(h^{2}) accurate).

The interpolation conditions are simply f(x+huj)=m(x+huj)f(x+hu_{j})=m(x+hu_{j}) j\forall j. If there are nn interpolation points (formed from a unit basis for 𝐑n)\mathbf{R}^{n}) then g(x)g(x) is the solution to

δ𝐟=hUTg(x),\delta\mathbf{f}=hU^{T}g(x), (73)

while if there are n+1n+1 interpolation points (i.e., a minimal basis for 𝐑n)\mathbf{R}^{n}) then g(x)g(x) is the solution to

δ𝐟+=hU+Tg(x).\delta\mathbf{f}_{+}=hU_{+}^{T}g(x). (74)

An explicit solution to (73) is known for a coordinate basis, and solutions to (74) are known for the coordinate minimal positive basis and a regular minimal positive basis.

An explicit solution to (73) for a regular basis (i.e., VV in (33)) is as follows. By (73), g(x)=1hV1δ𝐟g(x)=\frac{1}{h}V^{-1}\delta\mathbf{f}. Now, following the same arguments as the proof of Theorem 3 gives g(x)=1αh(δ𝐟+γn+1(eTδ𝐟)e)g(x)=\frac{1}{\alpha h}(\delta\mathbf{f}+\gamma\sqrt{n+1}(e^{T}\delta\mathbf{f})e), or componentwise, for j=1,,nj=1,\dots,n,

[g(x)]i=1αh(f(x+hvi)f(x)+γn+1(eTδ𝐟))\displaystyle[g(x)]_{i}=\tfrac{1}{\alpha h}(f(x+hv_{i})-f(x)+\gamma\sqrt{n+1}(e^{T}\delta\mathbf{f}))

The following theorem provides the standard bound for linear interpolation models.

Theorem 23.

Let {u1,,un+1}\{u_{1},\dots,u_{n+1}\} be a minimal positive basis for 𝐑n\mathbf{R}^{n} and let h𝐑h\in\mathbf{R}. Suppose points {x1,,xn+1,x1,,xn+1}\{x_{1},\dots,x_{n+1},x_{1}^{\prime},\dots,x_{n+1}^{\prime}\} are constructed via xj=x+hujx_{j}=x+hu_{j}, xj=x+ηhujx_{j}^{\prime}=x+\eta hu_{j} j=1,,n+1j=1,\dots,n+1. Let U+=[u1,,un+1]U_{+}=\begin{bmatrix}u_{1},\dots,u_{n+1}\end{bmatrix} and suppose that ff is continuously differentiable with an LL-Lipschitz continuous gradient. Then, the least squares solution g(x)g(x) to (73) satisfies

g(x)f(x)212LhU+2n+1.\displaystyle\|g(x)-\nabla f(x)\|_{2}\leq\tfrac{1}{2}Lh\|U_{+}^{\dagger}\|_{2}\sqrt{n+1}. (75)

It is clear that if nn points are used to generate the interpolation model, then the bound above becomes g(x)f(x)212LhU12n\|g(x)-\nabla f(x)\|_{2}\leq\tfrac{1}{2}Lh\|U^{-1}\|_{2}\sqrt{n}.

The results for linear interpolation models are summarized in Table 3. The error bounds give κL\kappa_{L}, where g(x)f(x)12LhκL\|g(x)-\nabla f(x)\|\leq\tfrac{1}{2}Lh\kappa_{L}, and holds when η=1\eta=-1.

g(x) Error κL\kappa_{L} # Samples Ref
CB 1hδ𝐟\frac{1}{h}\delta\mathbf{f} n\sqrt{n} nn
RB 1αh(δ𝐟+γn+1(eTδ𝐟)e)\frac{1}{\alpha h}(\delta\mathbf{f}+\gamma\sqrt{n+1}(e^{T}\delta\mathbf{f})e) nn nn This work
CMPB 1h(𝐟1n+1(eT𝐟+)e)\tfrac{1}{h}(\mathbf{f}-\tfrac{1}{n+1}(e^{T}\mathbf{f}_{+})e) n+1\sqrt{n+1} n+1n+1 [15]
RMPB 1αh(𝐟(γ(eT𝐟)1n+1fn+1)e)\tfrac{1}{\alpha h}\left(\mathbf{f}-(\gamma(e^{T}\mathbf{f})-\tfrac{1}{\sqrt{n+1}}f_{n+1})e\right) n\sqrt{n} n+1n+1 [13]
Table 3: CB = Coordinate Basis, RB = Regular Basis, CMPB = Coordinate Minimal Positive Basis, RMPB = Regular Minimal Positive Basis. Linear model summary.

6 Numerical Experiments

Here we present numerical examples to verify the results of this paper, and to provide a concrete application to show that the results have practical use. All experiments are performed using MATLAB (version 2018a).

6.1 Derivative Computations

Here we depart from our usual notation and let y𝐑2y\in\mathbf{R}^{2} with components y=[y1y2]Ty=\begin{bmatrix}y_{1}&y_{2}\end{bmatrix}^{T} so that Rosenbrock’s function can be written as

f(y1,y2)=(1y1)2+100(y2y12)2.\displaystyle f(y_{1},y_{2})=(1-y_{1})^{2}+100(y_{2}-y_{1}^{2})^{2}. (76)

The gradient of (76) is

f(y1,y2)=[2(1y1)400y1(y2y12)200(y2y12)]\displaystyle\nabla f(y_{1},y_{2})=\begin{bmatrix}-2(1-y_{1})-400y_{1}(y_{2}-y_{1}^{2})\\ 200(y_{2}-y_{1}^{2})\end{bmatrix} (77)

and the Hessian of (76) is

2f(y1,y2)=[2400y2+1200y12400y1400y1200].\displaystyle\nabla^{2}f(y_{1},y_{2})=\begin{bmatrix}2-400y_{2}+1200y_{1}^{2}&-400y_{1}\\ -400y_{1}&200\end{bmatrix}. (78)

Henceforth, we return to our usual notation.

For the first numerical experiment we simply use the results presented in this work to compute approximations to the gradient and diagonal of the Hessian at two different points, using Rosenbrock’s function. The points are chosen to be those where an accurate gradient is needed to make algorithmic progress on Rosenbrock’s function.

Point near valley floor

The first point is

x=[1.11.12+105],x=\begin{bmatrix}1.1\\ 1.1^{2}+10^{-5}\end{bmatrix}, (79)

which was chosen because it is close to the “valley floor” for Rosenbrock’s function where a good approximation to the gradient is required to make algorithmic progress.

The aligned regular simplex is constructed using the approach previously presented. Here, n=2n=2 so that α=3/2\alpha=\sqrt{3/2} and γ=12(113)\gamma=\frac{1}{2}(1-\frac{1}{\sqrt{3}}) by (31). Recalling that V=α(IγeeT)V=\alpha(I-\gamma ee^{T}) (33) gives (to 4d.p.)

V+=[v1v2v3]=[0.96590.25880.70710.25880.96590.7071]\displaystyle V_{+}=\begin{bmatrix}v_{1}&v_{2}&v_{3}\end{bmatrix}=\begin{bmatrix}0.9659&-0.2588&-0.7071\\ -0.2588&0.9659&-0.7071\end{bmatrix} (80)

For this experiment h=103h=10^{-3} was used. For the regular basis (RB), the points were generated as xj=x+hvjx_{j}=x+hv_{j}, xj=xhvjx_{j}^{\prime}=x-hv_{j} for j=1,2j=1,2, while the regular minimal positive basis (RMPB) used xj=x+hvjx_{j}=x+hv_{j}, xj=xhvjx_{j}^{\prime}=x-hv_{j} for j=1,2,3j=1,2,3. For the coordinate basis (CB), the points were generated as xj=x+hejx_{j}=x+he_{j}, xj=xhejx_{j}^{\prime}=x-he_{j} for j=1,2j=1,2, while the coordinate minimal positive basis used the previous points as well as x3=xhex_{3}=x-he and x3=x+hex_{3}^{\prime}=x+he.

Substituting (79) into (77) and (78) shows that the analytic gradient and Hessian are

f(x)=[0.195599990.00200000],diag(2f(x))=102×[9.699960002.00000000]\displaystyle\nabla f(x)=\begin{bmatrix}0.19559999\\ 0.00200000\end{bmatrix},\qquad{\rm diag}(\nabla^{2}f(x))=10^{2}\times\begin{bmatrix}9.69996000\\ 2.00000000\end{bmatrix} (81)

Approximations to the gradient and Hessian were built using the theorems presented earlier in this work, and are stated in Table 4. Also stated in the table is

ϵg=g(x)f(x)2,ϵd=d(x)diag(2f(x))2.\displaystyle\epsilon_{g}=\|g(x)-\nabla f(x)\|_{2},\qquad\epsilon_{d}=\|d(x)-{\rm diag}(\nabla^{2}f(x))\|_{2}.

That is, ϵg\epsilon_{g} is the computed error between the true gradient reported in (83) and the computed approximation. Similarly, ϵd\epsilon_{d} is the computed error between the true Hessian diagonal reported in (83) and the computed approximation. Note that, for the given point (79), the theoretical error bound for the gradient approximations is 16Mh2U+2n+1\tfrac{1}{6}Mh^{2}\|U_{+}^{{\dagger}}\|_{2}\sqrt{n+1}, where the Lipschitz constant is estimated as M=2.3081×103M=2.3081\times 10^{3}, so that 16Mh2n=5.4403×104\tfrac{1}{6}Mh^{2}\sqrt{n}=5.4403\times 10^{-4}.

g(x)g(x) ϵg\epsilon_{g} d(x)d(x) ϵd\epsilon_{d}
CB [0.196039990.00200000]\begin{bmatrix}0.19603999\\ 0.00200000\end{bmatrix} 4.39×1044.39\times 10^{-4} 102×[9.699961991.99999999]10^{2}\times\begin{bmatrix}9.69996199\\ 1.99999999\end{bmatrix} 1.99×1041.99\times 10^{-4}
RB [0.196089990.00211000]\begin{bmatrix}0.19608999\\ 0.00211000\end{bmatrix} 5.02×1045.02\times 10^{-4} 102×[11.899961974.19999997]10^{2}\times\begin{bmatrix}11.89996197\\ 4.19999997\end{bmatrix} 3.11×1023.11\times 10^{2}
CMPB [0.195973330.00193333]\begin{bmatrix}0.19597333\\ 0.00193333\end{bmatrix} 3.79×1043.79\times 10^{-4} 102×[6.766628670.93333333]10^{2}\times\begin{bmatrix}6.76662867\\ -0.93333333\end{bmatrix} 4.15×1024.15\times 10^{2}
RMPB [0.195929990.00195000]\begin{bmatrix}0.19592999\\ 0.00195000\end{bmatrix} 3.33×1043.33\times 10^{-4} 102×[9.699961751.99999975]10^{2}\times\begin{bmatrix}9.69996175\\ 1.99999975\end{bmatrix} 1.77×1041.77\times 10^{-4}
Table 4: The computed gradient and diagonal Hessian approximations using the Coordinate Basis, the Regular Basis, the Coordinate Minimal Positive Basis and the Regular Minimal Positive Basis. The computed error in each approximation is also reported.

Table 4 shows the the computed gradient approximations are all close to the true solution (83). The error in the gradient for each approximation is 104\approx 10^{-4}, which is below the theoretical bound, as expected. For a coordinate basis and a regular minimal positive basis, the approximation to the diagonal of the Hessian is also good, both having error of 104\approx 10^{-4}. However, for a regular basis and a coordinate minimal positive basis, the approximations d(x)d(x) are poor, both having large error 102\approx 10^{2}. Upon further inspection, one notices that d(x)d(x) for a regular basis is essentially just a shifted version of the approximation for a regular minimal positive basis (subtracting 2.2×102\approx 2.2\times 10^{2} from each component of the regular basis approximation gives the approximation for a regular minimal positive basis). A similar comment holds for the approximations for a coordinate basis and for a coordinate minimal positive basis. This is consistent with the Theorems presented in this work. For example, comparing Theorems 6 and 20, both estimates are the vector zz, with each component shifted by a fixed amount, with the fixed amount differing for each theorem.

Point near the solution

Here, the previous experiment is repeated at a point xx near the solution x=[1,1]Tx^{*}=[1,1]^{T}. Here, let

x=[0.90.81].x=\begin{bmatrix}0.9\\ 0.81\end{bmatrix}. (82)

When close to the solution, it is appropriate to choose hh to be small, so here we set h=106h=10^{-6}. Substituting (82) into (77) and (78) gives

f(x)=[0.200000000],diag(2f(x))=102×[6.500000002.00000000].\displaystyle\nabla f(x)=\begin{bmatrix}-0.20000000\\ 0\end{bmatrix},\qquad{\rm diag}(\nabla^{2}f(x))=10^{2}\times\begin{bmatrix}6.50000000\\ 2.00000000\end{bmatrix}. (83)

For this experiment with the point (82), the theoretical error bound for the gradient approximations is again 16Mh2U+2n+1\tfrac{1}{6}Mh^{2}\|U_{+}^{{\dagger}}\|_{2}\sqrt{n+1}, where the Lipschitz constant is estimated as M=1.8466×103M=1.8466\times 10^{3}, so that 16Mh2n=4.3526×1010\tfrac{1}{6}Mh^{2}\sqrt{n}=4.3526\times 10^{-10}.

g(x)g(x) ϵg\epsilon_{g} d(x)d(x) ϵd\epsilon_{d}
CB [0.199999990]\begin{bmatrix}-0.19999999\\ 0\end{bmatrix} 3.54×10103.54\times 10^{-10} 102×[6.499999981.99999999]10^{2}\times\begin{bmatrix}6.49999998\\ 1.99999999\end{bmatrix} 1.91×1061.91\times 10^{-6}
RB [0.199999990.00000000]\begin{bmatrix}-0.19999999\\ 0.00000000\end{bmatrix} 4.09×10104.09\times 10^{-10} 102×[8.300000003.80000003]10^{2}\times\begin{bmatrix}8.30000000\\ 3.80000003\end{bmatrix} 2.55×1022.55\times 10^{2}
CMPB [0.199999990.00000000]\begin{bmatrix}-0.19999999\\ -0.00000000\end{bmatrix} 2.95×10102.95\times 10^{-10} 102×[4.099999990.39999999]10^{2}\times\begin{bmatrix}4.09999999\\ -0.39999999\end{bmatrix} 3.39×1023.39\times 10^{2}
RMPB [0.199999990.00000000]\begin{bmatrix}-0.19999999\\ -0.00000000\end{bmatrix} 2.67×10102.67\times 10^{-10} 102×[6.499999992.00000001]10^{2}\times\begin{bmatrix}6.49999999\\ 2.00000001\end{bmatrix} 1.69×1061.69\times 10^{-6}
Table 5: The computed gradient and diagonal Hessian approximations using the Coordinate Basis, the Regular Basis, the Coordinate Minimal Positive Basis and the Regular Minimal Positive Basis. The computed error in each approximation is also reported.

Table 5 again shows the the computed gradient approximations are all close to the true solution (83), with the error the gradient approximations begin 1010\approx 10^{-10}, all of which are below the theoretical bound. For a coordinate basis and a regular minimal positive basis, the approximation to the diagonal of the Hessian is also good, both having error of 106\approx 10^{-6}. However, for a regular basis and a coordinate minimal positive basis, the approximations d(x)d(x) are again poor, both having large error 102\approx 10^{2}. Again, the solution d(x)d(x) for a coordinate minimal positive basis is a shifted version of the approximation for a coordinate basis, and similarly for the regular basis and regular minimal positive basis.

6.2 Derivative estimates within a frame based preconditioned conjugate gradients algorithm

The purpose of this section is to give an example of how the gradient estimates presented in this work could be used within an existing derivative free optimization algorithm. The algorithm employed here is the frame based preconditioned conjugate gradients method from [12] (henceforth referred to as FB-PCG), which can be applied to problems of the form (1) where ff is a smooth function whose derivatives are unavailable. The algorithm uses a Conjugate Gradients (CG) type inner loop, so approximations to the gradient are needed. It is well known that the practical behaviour of CG often improves when an appropriate preconditioner is used. Thus, the inexpensive estimates g(x)g(x) and D(x)D(x) are of particular use here; the D(x)D(x) developed in this work is an inexpensive approximation to the Hessian, and being diagonal, is a convenient preconditioner. The algorithm in [12] employs frames as a globalisation strategy to ensure convergence.

Here, the algorithm is implemented exactly the same way as in [12]. The only difference is the way in which the gradient and pure second derivatives are computed. In [12], g(x)g(x) and D(x)D(x) are estimated using finite differences. Here we set η=1\eta=-1, and run the algorithm using the four derivative variants presented previously in this work: (i) a Coordinate Basis (CB), which is equivalent to finite difference approximations because η=1\eta=-1, (ii) a Regular Basis (RB), (iii) a Coordinate Minimal Positive Basis (CMPB), and (iv) a Regular Minimal Positive Basis (RMPB). It is expected that the practical behaviour of the algorithm will be similar in each instance, because all gradient approximations used are 𝒪(h2)\mathcal{O}(h^{2}) accurate. However, we comment that for a coordinate minimal positive basis, the approximation d(x)d(x) was computed using finite differences, to ensure the best possible algorithmic performance.

Algorithm 1 Frame-Based Preconditioned Conjugate Gradients (FB-PCG)
1:  Initialization: Set k=1k=1, j=nj=n, h=θ0=1h=\theta_{0}=1, H=IH=I, N>0N>0, τmin>0\tau_{\min}>0, hmin>0h_{\min}>0, ν>1\nu>1.
2:  while Stopping conditions do not hold do
3:     Calculate the function values at the frame points, and form the gradient estimate g(xk)g(x_{k}).
4:     if j=1j=1 then
5:        Form the pure second derivative estimate D(xk)D(x_{k}).
6:     end if
7:     Check the stopping conditions.
8:     Calculate the new search direction pk+1=Hgk+1+βkpkp_{k+1}=-Hg_{k+1}+\beta_{k}p_{k}, where
βk=max{0,gkTH(gk+1gk)gkTHgk},Hii=1max{Di,104}\displaystyle\beta_{k}=\max\{0,\frac{g_{k}^{T}H(g_{k+1}-g_{k})}{g_{k}^{T}Hg_{k}}\},\quad H_{ii}=\frac{1}{\max\{D_{i},10^{-4}\}} (84)
9:     Execute line search: find θk=minθf(xk+θhkpk/pk)\theta_{k}=\min_{\theta}f(x_{k}+\theta h_{k}p_{k}/\|p_{k}\|)
10:     if j=1j=1 then
11:        update HH, set xk+1x_{k+1} to be the point with lowest known function value, set j=n+3j=n+3
12:     else
13:        decrease jj by 1 and set xk+1=xk+θhkpk/pkx_{k+1}=x_{k}+\theta h_{k}p_{k}/\|p_{k}\|.
14:     end if
15:     If frame is quasi-minimal set hk+1=hk/λh_{k+1}=h_{k}/\lambda; else hk+1=hkh_{k+1}=h_{k}. Update k=k+1k=k+1.
16:  end while

The algorithm is guaranteed to converge, as confirmed by the following theorem.

Theorem 24 (Theorem 1 in [12]).

If the following conditions are satisfied: (1) the sequence of iterates is bounded; (2) ff is continuously differentiable with a locally Lipschitz gradient; and (3) hk0h_{k}\to 0 as kk\to\infty, then the subsequence of quasi-minimal frame centers is infinite, and all cluster points of this subsequence are stationary points of ff.

Remark 25.

The algorithm of [23] is based upon the FB-PCG algorithm in [12], but there are several differences. In [23], a single minimal positive basis is employed at each iteration (i.e., a linear model is used, and approximations to the second derivatives are unavailable). Also, for the algorithm in [23], at each iteration, the new point xk+1x_{k+1} is always set to be the point with the best (lowest) function value encountered. On the other hand, the Algorithm 1 from [12] performs n+3n+3 iterations of PCG, to maintain the positive properties of the PCG algorithm. After n+3n+3 iterations there is a reset, where the PCG process begins again from the point with the lowest function value over the preceding n+3n+3 iterations. Because [23] employs only 𝒪(h)\mathcal{O}(h) approximations to g(x)g(x) and does not use approximations d(x)d(x), we do not consider it here.

The problems tested are taken from the MGH test set in [27] and were chosen to allow comparison with the results in [12] and [23]. The problems are stated in Table 6, and the results of applying the FB-PCG algorithms to the test problems are given in Tables 7 and 8. As suggested in Section 2.2 of [26], each algorithm was terminated if it reached the maximum budget of 1300 function evaluations.

No. Function Name nn No. Function Name nn
1 Rosenbrock 2 12 Box 3-D 3
2 Freudenstein and Roth 2 13 Powell singular 4
3 Powell badly scaled 2 14 Woods function 4
4 Brown badly scaled 2 15 Kowalski and Osborne 4
5 Beale 2 16 Brown and Dennis 4
6 Jennrich and Sampson 2 17 Osborne 1 5
7 Helical valley function 3 18 Bigg’s exponential 6
8 Bard 3 19 Osborne 2 11
9 Gaussian function 3 20 Trigonometric 2
10 Meyer function 3 21 Brown almost linear 2
11 Gulf research & development 3 22 Broyden tridiagonal 20
Table 6: The test problems used in the numerical experiments.
Prob. grad type nf fmin g(x)\|g(x)\| h itns qmfs
1 CB 300 5.2336e-11 7.6674e-06 5.6843e-07 34 15
RB 1300 5.3137e+00 4.7256e+00 1.5625e-02 149 3
CMPB 345 2.8696e-13 1.3357e-06 3.5527e-06 31 15
RMPB 381 5.5919e-11 8.3890e-06 2.2737e-06 34 14
2 CB 117 4.8984e+01 8.9198e-05 9.5367e-06 14 9
RB 102 4.8984e+01 2.6257e-06 3.8147e-06 12 9
CMPB 245 5.9689e-18 9.3936e-08 5.9605e-07 23 11
RMPB 165 4.8984e+01 8.6418e-06 5.9605e-06 16 10
3 CB 1300 1.0178e-08 4.8733e+00 6.2500e-10 86 44
RB 1300 1.1101e-08 2.2184e+00 1.5558e-09 102 49
CMPB 208 1.0101e-08 2.0101e-04 1.0000e-10 18 18
RMPB 1300 1.5221e-07 1.7535e+01 1.7937e-08 71 36
4 CB 188 2.8703e-24 3.3884e-06 9.0949e-08 20 15
RB 191 4.1550e-26 4.0788e-07 1.0000e-10 21 19
CMPB 1300 1.2157e+09 6.9664e+10 9.7656e+00 119 1
RMPB 274 7.5858e-24 5.5090e-06 2.2737e-10 21 18
5 CB 96 1.7741e-12 9.5287e-06 9.5367e-06 12 9
RB 131 7.6731e-14 7.3698e-07 1.4901e-06 16 11
CMPB 137 1.9443e-13 7.2973e-07 9.5367e-07 14 10
RMPB 135 1.3844e-12 1.0291e-06 9.5367e-06 13 9
6 CB 214 1.2436e+02 1.1450e-06 9.3132e-08 26 13
RB 179 1.2436e+02 6.9703e-04 2.3842e-06 21 10
CMPB 242 1.2436e+02 3.9619e-06 1.4901e-07 23 12
RMPB 198 1.2436e+02 4.9373e-05 5.9605e-08 20 12
7 CB 277 2.4478e-16 4.0593e-08 2.2737e-08 27 16
RB 358 1.3523e-12 5.2517e-06 3.6380e-06 33 13
CMPB 386 1.1919e-17 1.2976e-07 2.2204e-08 31 18
RMPB 404 1.3124e-13 6.2378e-07 5.6843e-08 32 16
8 CB 228 8.2149e-03 1.5169e-06 3.6380e-08 21 15
RB 226 8.2149e-03 1.2996e-06 1.4552e-07 21 14
CMPB 332 8.2149e-03 3.2610e-06 9.3132e-08 27 13
RMPB 325 8.2149e-03 5.2576e-06 5.6843e-08 25 16
9 CB 88 1.1279e-08 3.0817e-08 3.8147e-06 9 9
RB 88 1.1279e-08 2.6819e-08 3.8147e-06 9 9
CMPB 106 1.1279e-08 2.5277e-08 3.8147e-06 9 9
RMPB 106 1.1279e-08 2.2890e-07 3.8147e-06 9 9
10 CB 1300 2.6761e+04 1.4694e+06 1.7937e-03 95 31
RB 1300 4.7088e+06 1.4611e+08 8.6736e-05 114 16
CMPB 1300 8.7574e+06 3.3756e+08 5.5511e-02 99 12
RMPB 1300 6.6179e+06 2.0357e+08 8.6736e-04 95 15
11 CB 611 2.1173e-12 5.3498e-06 1.2622e-08 50 27
RB 586 8.6109e-10 9.3682e-06 8.0779e-09 48 26
CMPB 653 5.8997e-10 7.3782e-06 8.2718e-09 46 24
RMPB 1300 3.7178e-04 1.1058e-02 7.1746e-05 97 32
Table 7: Results of FB-PCG applied to problems 1–11 from Table 6.
Prob. op nf fmin ng h itn qmf
12 CB 258 1.0126e-06 8.9612e-06 5.6843e-10 22 18
RB 132 7.8257e-17 1.1635e-08 5.9605e-06 12 10
CMPB 336 3.7634e-08 5.5972e-06 3.5527e-09 24 18
RMPB 170 8.7957e-13 9.6558e-08 1.4901e-06 13 11
13 CB 388 9.5087e-09 7.8340e-06 8.8818e-10 29 19
RB 680 1.2044e-11 3.4770e-06 8.2718e-10 53 25
CMPB 539 9.7590e-10 2.7194e-06 1.3878e-09 36 20
RMPB 732 1.2329e-09 2.2736e-06 5.2940e-09 49 23
14 CB 496 2.2343e-13 3.8564e-06 1.3878e-08 39 19
RB 664 1.0394e-12 9.6018e-06 1.3553e-08 52 21
CMPB 1175 7.0443e-21 3.2592e-09 3.1554e-10 79 29
RMPB 1200 2.6560e-16 4.1629e-07 3.0815e-09 80 30
15 CB 409 3.0751e-04 7.0794e-08 3.4694e-10 29 21
RB 730 3.0751e-04 8.1424e-07 3.0815e-09 54 30
CMPB 456 3.0751e-04 2.1624e-07 1.0000e-10 29 22
RMPB 879 3.0751e-04 2.9915e-06 1.9259e-10 56 32
16 CB 218 8.5822e+04 2.0390e-01 2.3842e-05 18 9
RB 256 8.5822e+04 5.2632e-02 5.9605e-06 21 10
CMPB 328 8.5822e+04 8.3039e-03 9.5367e-06 23 9
RMPB 269 8.5822e+04 1.3964e-02 5.9605e-06 19 10
17 CB 1300 6.1863e-05 7.8087e-02 7.5232e-06 86 29
RB 1300 9.2307e-02 6.5300e+01 1.5259e-03 86 6
CMPB 1300 3.3394e-03 6.3918e+00 9.5367e-04 77 7
RMPB 1300 5.3120e-02 1.9312e+01 9.5367e-04 78 7
18 CB 522 5.6556e-03 5.2532e-06 8.6736e-09 30 20
RB 1300 5.6557e-03 1.8064e-04 1.1210e-08 73 37
CMPB 712 5.6557e-03 3.1441e-06 2.1176e-09 36 23
RMPB 1300 5.6559e-03 8.2987e-04 7.3468e-08 65 33
19 CB 1300 4.0437e-02 4.3713e-02 1.3235e-03 48 18
RB 1300 2.8701e-01 2.5030e+00 6.2500e-02 49 2
CMPB 1300 4.6094e-02 2.3523e-01 5.2940e-03 45 17
RMPB 1300 2.8656e-01 3.4951e+00 6.2500e-02 46 2
20 CB 80 9.2331e-18 1.3956e-08 3.8147e-06 10 9
RB 78 1.0496e-17 1.4880e-08 3.8147e-06 10 9
CMPB 97 1.9978e-17 2.0529e-08 3.8147e-06 10 9
RMPB 99 8.0511e-18 1.3032e-08 3.8147e-06 10 9
21 CB 77 2.3695e-19 1.8024e-09 3.8147e-06 10 9
RB 77 1.5376e-16 6.2567e-08 3.8147e-06 10 9
CMPB 98 6.8671e-14 1.3603e-06 3.8147e-06 10 9
RMPB 96 7.0015e-15 4.3732e-07 3.8147e-06 10 9
22 CB 1155 7.4976e-13 7.8573e-06 1.0000e-10 26 19
RB 1244 7.6929e-13 8.5455e-06 1.0000e-10 28 18
CMPB 1207 2.9752e-13 5.3954e-06 1.0000e-10 26 19
RMPB 1113 4.3271e-13 8.0808e-06 1.0000e-10 24 18
Table 8: Results of FB-PCG applied to problems 12–22 from Table 6.

The experiments show that the Frame Based Preconditioned Conjugate Gradients algorithm performs similarly regardless of which gradient and diagonal Hessian approximation is used, as expected.

7 Conclusion

This work studied the use of interpolation models to obtain approximations to the gradient and diagonal of the Hessian of a function f:𝐑n𝐑f:\mathbf{R}^{n}\to\mathbf{R}. We showed that if the coordinate basis is used in the interpolation model, with the parameter η=1\eta=-1 used to generate the interpolation points, then one recovers the standard finite difference approximations to the derivative. We also showed that if a regular basis, a coordinate minimal positive basis, or a regular minimal positive basis are used within the interpolation model, then approximations g(x)g(x) to the gradient, and d(x)d(x) to the pure second derivatives are available in 𝒪(n)\mathcal{O}(n) flops. An error bound was presented which shows that for each set of interpolation directions considered in this work, the gradient is 𝒪(h2)\mathcal{O}(h^{2}) accurate. Numerical experiments were presented that show the practical uses of the estimates developed. In particular, we showed how they could be employed within the FB-PCG algorithm to solve (1).

References

  • [1] Charles Audet and Warren Hare. Derivative-free and blackbox optimization. Springer Series in Operations Research and Financial Engineering. Springer, 2017.
  • [2] A. S. Bandeira, K. Scheinberg, and L. N. Vicente. Computation of sparse low degree interpolating polynomials and their application to derivative-free optimization. Mathematical Programming Series B, 134:223–257, 2012.
  • [3] A. G. Baydin, B. A. Pearlmutter, A. A. Radul, and J. M. Siskind. Automatic differentiation in machine learning: a survey. Journal of Machine Learning Research, 18:1–43, 2018. Editor: Léon Bottou.
  • [4] A. S. Berahas, L. Cao, K. Choromanskiy, and K. Scheinberg. A theoretical and empirical comparison of gradient approximations in derivative-free optimization. Technical report, Department of Industrial and Systems Engineering, Lehigh University,, Bethlehem, PA, USA, 2019. arXiv:1905.01332v2 [math.OC].
  • [5] G. Cocchi, G. Liuzzi, A. Papini, and M. Sciandrone. An implicit filtering algorithm for derivative-free multiobjective optimization with box constraints. Computational Optimization and Applications, 69(2):267–296, 2018.
  • [6] A. R. Conn, K. Scheinberg, and Luís N. Vicente. Geometry of interpolation sets in derivative free optimization. Mathematical Programming Series B, 111:141–172, 2008.
  • [7] A. R. Conn, K. Scheinberg, and Luís N. Vicente. Geometry of sample sets in derivative-free optimization: polynomial regression and underdetermined interpolation. IMA Journal of Numerical Analysis, 28:721–748, 2008.
  • [8] Andrew R. Conn and Philippe L. Toint. An algorithm using quadratic interpolation for unconstrained derivative free optimization. In G. Di Pillo and F. Giannessi, editors, Nonlinear Optimization and Applications, pages 27–47, Boston, MA, 1996. Springer US.
  • [9] A.R. Conn, K.Scheinberg, and L.N. Vicente. Introduction to Derivative-Free Optimization. MPS-SIAM Series on Optimization, Philadelphia, 2009.
  • [10] A.R. Conn, K.Scheinberg, and L.N. Vicente. Introduction to derivative-free optimization. MPS-SIAM Series on Optimization, Philadelphia, 2009.
  • [11] A.R. Conn, K. Scheinberg, and P.L. Toint. Recent progress in unconstrained nonlinear optimization without derivatives. Mathematical Programming, 79:397–414, 1997.
  • [12] I. D. Coope and C. J. Price. A derivative-free frame-based conjugate gradients method. Ucdms2002-7, University of Canterbury, 2002. https://ir.canterbury.ac.nz/handle/10092/11758.
  • [13] I. D. Coope and R. Tappenden. Efficient calculation of regular simplex gradients. Computational Optimization and Applications, 72(3):561–588, April 2019.
  • [14] I.D. Coope and C.J. Price. Frame-based methods for unconstrained optimization. Journal of Optimization Theory & Applications, 107(2):261–274, 2000.
  • [15] I.D. Coope and C.J. Price. Positive bases in numerical optimization. Computational Optimization & Applications, 21:169–175, 2002.
  • [16] Giovanni Fasano, José Luis Morales, and Jorge Nocedal. On the geometry phase in model-based algorithms for derivative-free optimization. Optimization Methods and Software, 24(1):145–154, 2009.
  • [17] Maryam Fazel, Rong Ge, Sham Kakade, and Mehran Mesbahi. Global convergence of policy gradient methods for the linear quadratic regulator. Proceedings of Machine Learning Research (PMLR), 80:1467–1476, 2018. International Conference on Machine Learning, 10-15 July 2018, Stockholmsmässan, Stockholm, Sweden.
  • [18] P. Gilmore, C. T. Kelley, C. T. Miller, and G. A. Williams. Implicit filtering and optimal design problems. In Jeffrey Borggaard, John Burkardt, Max Gunzburger, and Janet Peterson, editors, Optimal Design and Control, pages 159–176, Boston, MA, 1995. Birkhäuser Boston.
  • [19] P. Gilmore and C.T. Kelley. An implicit filtering algorithm for optimization of functions with many local minima. SIAM J. Optim., 5(2):269–285, 1995.
  • [20] W. Hare and M. Jaberipour. Adaptive interpolation strategies in derivative-free optimization: a case study. Technical report, University of British Colombia, Canada, and Amirkabir University of Technology, Iran, November 2015. arXiv:1511.02794v1 [math.OC].
  • [21] Philipp H. W. Hoffmann. A hitchhiker’s guide to automatic differentiation. Numerical Algorithms, 72(3):775–811, 2016.
  • [22] Gabriel Jarry-Bolduc, Patrick Nadeau, and Shambhavi Singh. Uniform simplex of an arbitrary orientation. Optimization Letters, 2019. Published online 03 July 2019, https://doi.org/10.1007/s11590-019-01448-3.
  • [23] Q. Liu. Two minimal positive basis based direct search conjugate gradient methods for computationally expensive functions. Numer Algor, 58:461–474, 2011.
  • [24] Alvaro Maggiar, Andreas Wächter, Irina S Dolinskaya, and Jeremy Staum. A derivative-free trust-region algorithm for the optimization of functions smoothed via gaussian convolution using adaptive multiple importance sampling. SIAM Journal on Optimization, 28(2):1478–1507, 2018.
  • [25] C. C. Margossian. A review of automatic differentiation and its efficient implementation. Technical report, Department of Statistics, Columbia University, January 2019. arXiv:1811.05031v2 [cs.MS].
  • [26] J. J. Moré and S. Wild. Benchmarking derivative-free optimization algorithms. SIAM Journal on Optimization, 20(1):172–191, 2009.
  • [27] J.J. More, B.S. Garbow, and K.E. Hillstrom. Testing unconstrained optimization software. ACM Trans. Math. Softw., 7(1):17–41, 1981.
  • [28] J.A. Nelder and R. Mead. A simplex method for function minimization. The computer journal, 7(4):308–313, 1965.
  • [29] Yurii Nesterov and Vladimir Spokoiny. Random gradient-free minimization of convex functions. Foundations of Computational Mathematics, 17(2):527–566, 2017.
  • [30] Jorge Nocedal and Stephen J. Wright. Numerical Optimization. Springer Series in Operations Research. Springer, 2 edition, 2006.
  • [31] Luis Miguel Rios and Nikolaos V. Sahinidis. Derivative-free optimization: a review of algorithms and comparison of software implementations. J Glob Optim, 56:1247–1293, 2013.
  • [32] W. Spendley, G.R. Hext, and F.R. Himsworth. Sequential application of simplex designs in optimisation and evolutionary operation. Technometrics, 4:441–461, 1962.
  • [33] Stefan M. Wild and Christine Shoemaker. Global convergence of radial basis function trust-region algorithms for derivative-free optimization. SIAM Review, 55(2):349–371, 2013.

Appendix A Constants

In this section we state several equivalences between constants that are used in this work.

γ1nγ=γn+1=1n(n+11),\displaystyle\tfrac{\gamma}{1-n\gamma}=\gamma\sqrt{n+1}=\tfrac{1}{n}(\sqrt{n+1}-1), (85)
γ2=(31)1n2(11n+1)2=1n2(1n+1+12n+1)=1n2(n+2n+12n+1)\displaystyle\gamma^{2}\overset{\eqref{eq:alphagamma}}{=}\tfrac{1}{n^{2}}\left(1-\tfrac{1}{\sqrt{n+1}}\right)^{2}=\tfrac{1}{n^{2}}\left(\tfrac{1}{n+1}+1-\tfrac{2}{\sqrt{n+1}}\right)=\tfrac{1}{n^{2}}\left(\tfrac{n+2}{n+1}-\tfrac{2}{\sqrt{n+1}}\right) (86)
α2γ2=(31)+(86)n+1n3(n+2n+12n+1)\displaystyle\alpha^{2}\gamma^{2}\overset{\eqref{eq:alphagamma}+\eqref{eq:gammasquared}}{=}\tfrac{n+1}{n^{3}}\left(\tfrac{n+2}{n+1}-\tfrac{2}{\sqrt{n+1}}\right) =\displaystyle= 1n3(n+22n+1)\displaystyle\tfrac{1}{n^{3}}(n+2-2\sqrt{n+1}) (87)
=\displaystyle= 1n3(n+11)2.\displaystyle\tfrac{1}{n^{3}}(\sqrt{n+1}-1)^{2}. (88)
12γ\displaystyle 1-2\gamma =\displaystyle= (12n+2nn+1)=1n(n2+2n+1)=(n2)n+1+2nn+1.\displaystyle\left(1-\tfrac{2}{n}+\tfrac{2}{n\sqrt{n+1}}\right)=\tfrac{1}{n}\left(n-2+\tfrac{2}{\sqrt{n+1}}\right)=\tfrac{(n-2)\sqrt{n+1}+2}{n\sqrt{n+1}}. (89)
μ=α2(12γ)\displaystyle\mu=\alpha^{2}(1-2\gamma) =(89)\displaystyle\overset{\eqref{eq:oneminustwogamma}}{=} n+1n((n2)n+1+2nn+1)=1n2((n2)(n+1)+2n+1).\displaystyle\tfrac{n+1}{n}\left(\tfrac{(n-2)\sqrt{n+1}+2}{n\sqrt{n+1}}\right)=\tfrac{1}{n^{2}}\left((n-2)(n+1)+2\sqrt{n+1}\right). (90)
ωn=nγ212γ=(89)+(86)(n+1(n2)n+1+2)(n+22n+1n+1)=(n+22n+1(n2)(n+1)+2n+1)\displaystyle\omega n=\tfrac{n\gamma^{2}}{1-2\gamma}\overset{\eqref{eq:oneminustwogamma}+\eqref{eq:gammasquared}}{=}\left(\tfrac{\sqrt{n+1}}{(n-2)\sqrt{n+1}+2}\right)\left(\tfrac{n+2-2\sqrt{n+1}}{n+1}\right)=\left(\tfrac{n+2-2\sqrt{n+1}}{(n-2)(n+1)+2\sqrt{n+1}}\right) (91)
1+ωn=(91)(n2)(n+1)+2n+1(n2)(n+1)+2n+1+(n+22n+1(n2)(n+1)+2n+1)=n2(n2)(n+1)+2n+1\displaystyle 1+\omega n\overset{\eqref{eq:omegan}}{=}\tfrac{(n-2)(n+1)+2\sqrt{n+1}}{(n-2)(n+1)+2\sqrt{n+1}}+\left(\tfrac{n+2-2\sqrt{n+1}}{(n-2)(n+1)+2\sqrt{n+1}}\right)=\tfrac{n^{2}}{(n-2)(n+1)+2\sqrt{n+1}} (92)
ωn1+ωn=(91)+(92)1n2(n+22n+1)\displaystyle\tfrac{\omega n}{1+\omega n}\overset{\eqref{eq:omegan}+\eqref{eq:oneplusomegan}}{=}\tfrac{1}{n^{2}}\left(n+2-2\sqrt{n+1}\right) (93)
1μ=(90)1n2(n2((n2)(n+1)+2n+1))=(93)ωn1+ωn\displaystyle 1-\mu\overset{\eqref{eq:muasn}}{=}\tfrac{1}{n^{2}}\left(n^{2}-\left((n-2)(n+1)+2\sqrt{n+1}\right)\right)\overset{\eqref{eq:omeganoveroneplusomegan}}{=}\tfrac{\omega n}{1+\omega n} (94)