Gradient and Hessian approximations in Derivative Free Optimization

Ian D. Coope and Rachael Tappenden

Abstract

This work investigates finite differences and the use of interpolation models to obtain approximations to the first and second derivatives of a function. Here, it is shown that if a particular set of points is used in the interpolation model, then the solution to the associated linear system (i.e., approximations to the gradient and diagonal of the Hessian) can be obtained in $\mathcal{O}(n)$ computations, which is the same cost as finite differences, and is a saving over the $\mathcal{O}(n^{3})$ cost when solving a general unstructured linear system. Moreover, if the interpolation points are formed using a ‘regular minimal positive basis’, then the error bound for the gradient approximation is the same as for a finite differences approximation. Numerical experiments are presented that show how the derivative estimates can be employed within an existing derivative free optimization algorithm, thus demonstrating one of the potential practical uses of these derivative approximations.

Keywords. Derivative Free Optimization; Positive Bases; Finite difference approximations; Interpolation models; Taylor series; Conjugate Gradients; Simplices; Simplex Gradients; Preconditioning; Frames.

1 Introduction

This work considers unconstrained optimization problems of the form

\displaystyle\min_{x\in\mathbf{R}^{n}}f(x),

(1)

where $f:\mathbf{R}^{n}\to\mathbf{R}$ is smooth, but the derivatives of $f$ are unavailable. This is a typical derivative free optimization setting. Problems of this form arise, for example when the function is smooth but it is computationally impractical to obtain derivative information, or when the function itself is not known explicitly although function values are available from a black-box or oracle, (for example via physical measurements).

A key ingredient of many algorithms for solving (1) in a derivative free setting is an estimate of the gradient of $f$ . Sometimes gradients are used to determine search directions (for example, in implicit filtering algorithms [5, 18, 19]), and sometimes they are used in the definition of an algorithm’s stopping criteria. Several techniques exist for determining estimates to derivatives, including automatic differentiation [21, 3, 25], finite difference approximations, interpolation models [8, 11], Gaussian smoothing [24, 29], and smoothing on the unit sphere [17]. This work investigates finite differences and interpolation models, so henceforth, we focus on these approaches.

A well known, simple, and widely applicable approach to estimating derivatives is finite difference approximations (or finite differences). Finite differences are so called because they provide estimates to derivatives by taking the difference between function values at certain known points. For example, for a function $f:\mathbf{R}^{n}\to\mathbf{R}$ , the forward differences approximation $g(x)$ to the (first) derivative is

[g(x)]_{i}=\frac{f(x+he_{i})-f(x)}{h},\qquad\text{for}\;i=1,\dots,n,

(2)

which is an $\mathcal{O}(h)$ accurate approximation. Many other (well known) formulations exist, including approximations that provide a higher accuracy estimates of the gradient, as well as formulations for higher order derivatives. The standard finite difference approximation formulations can be used whenever function values evaluated along coordinate directions are available. The formulations involve minimal computations ( $\mathcal{O}(n)$ operations), and they have low memory requirements because one does not need to store the coordinate directions, but finite differences can suffer from cancellation errors.

Another approach to approximating derivatives is to use interpolation models [9, 33, 20]. Interpolation models can be viewed as a generalization of finite differences, where, instead of being restricted to the coordinate directions, function evaluations along a general set of directions are utilized, and derivative approximations are found by solving an associated system of equations (i.e., minimizing the interpolation model). Linear models lead to approximations to the gradient [4], while quadratic models provide approximations to the gradient and Hessian [8]. Much research has focused on the ‘quality’ of the set of directions, because this impacts the accuracy of the derivative estimates [2, 6, 7, 16].

The purpose of this work is to show that, if one uses a (diagonal) quadratic interpolation model with carefully chosen interpolation directions and $2n+2$ function values, then approximations to the gradient and diagonal of the Hessian can be obtained in $\mathcal{O}(n)$ computations and with only $\mathcal{O}(1)$ vectors of storage. The solution to this specific instance of an interpolation model can be thought of, in some sense, as a general finite difference type approximation to the gradient and diagonal of the Hessian, because the solution to the interpolation model is a formula that involves sums/differences of function values at the interpolation points. Although the focus of this work is on derivative estimates and not algorithmic development, the estimates here are widely applicable, and could be employed by many algorithms that require gradient approximations. Moreover, the diagonal Hessian approximation is readily available, and could be used as preconditioner within an algorithm. (See the numerical experiments in Section 6, which include a derivative free preconditioned conjugate gradients algorithm.)

It remains to note that the approaches mentioned above all require access to function values, and the cost of evaluating a function in a derivative free optimization setting can vary greatly depending upon the particular application being considered. It is up to the user to determine their function evaluation budget, and the approaches in this work require between $n+1$ and $2n+2$ function evaluations to generate derivative approximations.

1.1 Summary of Contributions

The contributions of this work are listed now.

1.

Derivative estimates. This work builds a quadratic interpolation model (with a diagonal approximation to the Hessian), and shows that if a certain set of interpolation directions are used in the quadratic model, with $2n+2$ associated function evaluations, then approximations to the gradient $g(x)$ and diagonal of the Hessian $D(x)$ can be found in $\mathcal{O}(n)$ operations. The set of interpolation directions is referred to as a regular minimal positive basis, and these directions need not be stored in full, but can be computed on-the-fly from $\mathcal{O}(1)$ stored vectors.
2.

An error bound. We confirm that the gradient approximations developed in this work are $\mathcal{O}(h^{2})$ accurate, where $h$ is the ‘sampling radius’. (See Section 4.5.)
3.

Application. We provide an example of how the gradient estimates developed in this work could be used in practice. In particular, we take the frame based, preconditioned conjugate gradients algorithm developed in [12], which requires estimates of the gradient and a diagonal preconditioner, and employ the estimates $g(x)$ and $D(x)$ developed in this work. (The original paper [12] employed the standard central difference approximations to the gradient and diagonal of the Hessian.)

1.2 Notation

Let $e\in\mathbf{R}^{N}$ denote the vector of all ones and let $e_{j}\in\mathbf{R}^{N}$ for $j=1,\dots,N$ denote the standard unit basis vectors in $\mathbf{R}^{N}$ , i.e., $e_{j}$ is the $j$ th column of the $N\times N$ identity matrix $I$ . (Usually we will take $N=n$ or $N=n+1$ in this work.)

Consider the points $x_{1},\dots,x_{n+1}\in\mathbf{R}^{n}$ . Define $f_{j}:=f(x_{j})$ for $j=1,\dots,n+1$ , i.e., $f_{j}$ denotes the function value at the point $x_{j}$ . It is convenient to define the following vectors, which contain the function values:

\mathbf{f}_{+}=\begin{bmatrix}\mathbf{f}\\ f_{n+1}\end{bmatrix}\in\mathbf{R}^{n+1}\quad\text{where}\quad\mathbf{f}:=\begin{bmatrix}f_{1}\\ \vdots\\ f_{n}\end{bmatrix}\in\mathbf{R}^{n}.

(3)

Similarly, for the points $x_{1}^{\prime},\dots,x_{n+1}^{\prime}\in\mathbf{R}^{n}$ , define $f_{j}^{\prime}:=f(x_{j}^{\prime})$ for all $j$ , (i.e., $f_{j}^{\prime}$ denotes the function value at the point $x_{j}^{\prime}$ ), and

\mathbf{f}_{+}^{\prime}=\begin{bmatrix}\mathbf{f}^{\prime}\\ f_{n+1}^{\prime}\end{bmatrix}\in\mathbf{R}^{n+1}\quad\text{where}\quad\mathbf{f}^{\prime}:=\begin{bmatrix}f_{1}^{\prime}\\ \vdots\\ f_{n}^{\prime}\end{bmatrix}\in\mathbf{R}^{n}.

(4)

It will also be convenient to define the vectors

\delta\mathbf{f}:=\mathbf{f}-f(x)e\in\mathbf{R}^{n},\qquad\delta\mathbf{f}^{\prime}:=\mathbf{f}^{\prime}-f(x)e\in\mathbf{R}^{n}

(5)

and

\delta\mathbf{f}_{+}:=\mathbf{f}_{+}-f(x)e\in\mathbf{R}^{n+1},\qquad\delta\mathbf{f}_{+}^{\prime}:=\mathbf{f}_{+}^{\prime}-f(x)e\in\mathbf{R}^{n+1}.

(6)

Usually, vectors are assumed to lie in $\mathbf{R}^{n}$ , while the subscript ‘ $+$ ’ denotes that the vector has an additional entry, so is an element of $\mathbf{R}^{n+1}$ (recall the notation above). Matrices are assumed to be elements of $\mathbf{R}^{n\times n}$ , while the subscript ‘ $+$ ’ denotes an additional column, so that the matrix is an element of $\mathbf{R}^{n\times(n+1)}$ .

The symbol ‘ $\odot$ ’ denotes the Hadamard product, i.e., given $x,y\in\mathbf{R}^{n}$ , $x\odot y\in\mathbf{R}^{n}$ where $(x\odot y)_{i}:=x_{i}y_{i}$ .

1.3 Paper Outline

This paper is organised as follows. Section 2 introduces the (diagonal) quadratic interpolation model employed in this work, and shows how derivative approximations can be found using the model. Section 3, describes how approximations to the gradient $g(x)$ and the diagonal of the Hessian $D(x)$ can be constructed from the model in $\mathcal{O}(n)$ computations when the interpolation directions are chosen to be the coordinate basis, and also when they are chosen to be a ‘regular basis’ (see Section 3.2). Many algorithms in derivative free optimization are based upon simplices, which can be generated from a minimal positive basis. Thus, in Section 4, approximations to the gradient $g(x)$ and the diagonal of the Hessian $D(x)$ that can be computed in $\mathcal{O}(n)$ are presented when the interpolation points are formed from either a coordinate minimal positive basis, or a regular minimal positive basis (see Section 4.2). It will also be shown that the regular basis and the regular minimal positive basis are closely related to a regular simplex, hence the terminology ‘regular’ (see Section 4.1). In Section 4.5 an error bound is presented to confirm that the gradient approximations developed in this work are $\mathcal{O}(h^{2})$ accurate, where $|h|$ is the sampling radius. Section 5 summarizes existing work on linear models, while numerical experiments are presented in Section 6 to demonstrate the accuracy and applicability of these results.

2 Derivative Estimation via Interpolation Models

This section describes how to approximate derivatives of a function via interpolation models. The Taylor series for a function $f$ about a point $x\in\mathbf{R}^{n}$ is

f(y)=f(x)+(y-x)^{T}\nabla f(x)+\tfrac{1}{2}(y-x)^{T}\nabla^{2}f(x)(y-x)+\mathcal{O}(\|y-x\|_{2}^{3}).

(7)

Let $g(x)$ denote an approximation to the gradient $\nabla f(x)$ (i.e., $g(x)\approx\nabla f(x)$ ), and let $H(x)$ denote an approximation to the Hessian $\nabla^{2}f(x)$ (i.e., $H(x)\approx\nabla^{2}f(x)$ ). Then (7) becomes

f(y)\approx f(x)+(y-x)^{T}g(x)+\tfrac{1}{2}(y-x)^{T}H(x)(y-x).

(8)

In this work, only a diagonal approximation to the Hessian is considered, so define

D(x):=\begin{bmatrix}d_{1}&&\\ &\ddots&\\ &&d_{n}\end{bmatrix}\in\mathbf{R}^{n\times n},\qquad d(x):=\begin{bmatrix}d_{1}\\ \vdots\\ d_{n}\end{bmatrix}\in\mathbf{R}^{n},

(9)

where $D(x)\approx{\rm diag}(\nabla^{2}f(x))$ . Combining (8) and (9) leads to the (diagonal) quadratic model

m(y)=f(x)+(y-x)^{T}g(x)+\tfrac{1}{2}(y-x)^{T}D(x)(y-x).

(10)

The motivation for this diagonal quadratic model is as follows. It is well known that, if one uses a linear model with $n$ (or $n+1$ ) interpolation points (i.e., $n$ or $n+1$ function evaluations), then an estimate of the gradient is available that is $\mathcal{O}(h)$ accurate (see also Section 5 which summarises results for linear models). For some situations, an $\mathcal{O}(h)$ accurate gradient is not of sufficient accuracy to make algorithmic progress, so an $\mathcal{O}(h^{2})$ accurate gradient is used instead. In this case, one can still use a linear model, but $2n$ (or $2n+2$ ) interpolation points (i.e., $2n$ (or $2n+2$ ) function evaluations) are needed. In derivative free optimization, function evaluations are often expensive to obtain, so it is prudent to ‘squeeze out’ as much information from them as possible. Hence, suppose one had available $2n$ (or $2n+2$ ) function evaluations because an accurate gradient was desired. The model (10) contains $2n$ unknowns, so $2n$ interpolation points and function values is enough to uniquely determine an $\mathcal{O}(h^{2})$ accurate gradient $g(x)$ , as well as an approximation to the pure second derivatives on the diagonal of $D(x)$ , i.e., no additional function evaluations are required to compute the pure second derivatives. Although $D(x)$ can be determined ‘for free’ in terms of function evaluations (assuming $2n$ are available for the gradient estimate), we will also show that the diagonal matrix $D(x)$ can be computed in $\mathcal{O}(n)$ flops, so the linear algebra costs are low. This is similar to the case for finite differences (i.e., one requires $2n(+1)$ function evaluations to compute $g(x)$ via the central differences formula, and those same function values can also be used to compute the pure second derivatives), and so this diagonal quadratic model (10) is useful because one can obtain analogous results for a certain set of interpolation points.

The goal is to determine $g(x)$ and $D(x)$ from the interpolation model (10) using a set of known points and the corresponding function values. Hence, one must build a sample set (a set of interpolation points). In general, one can use any number of interpolation points, although this affects the model. For a linear model (omitting the third term in (10)), one can determine a unique solution $g(x)$ if there are $n$ interpolation/sample points with known function values at those points (see Section 5). For a general quadratic model with Hessian approximation $H(x)$ , one requires $(n+1)(n+2)/2$ function values to determine unique solutions $g(x)$ and $H(x)$ . If one has access to more or fewer sample points/function values, then one can use, for example, a least squares approach to find approximations to $g(x)$ and $H(x)$ . In the rest of this section, $2n$ interpolation points will be used, and it will become clear that this allows unique solutions $g(x)$ and $D(x)$ to the model (10).

Given $x$ , a set of directions $\{u_{1},\dots,u_{n}\}\subset\mathbf{R}^{n}$ , and a scalar $h$ (sometimes referred to as the sampling radius), a set of points $\{x_{1},\dots,x_{n}\}$ is defined via the equations

\displaystyle x_{j}=x+hu_{j},\quad\|u_{j}\|_{2}=\tfrac{1}{|h|}\|x_{j}-x\|_{2},\qquad j=1,\dots,n.

(11)

The model is constructed to satisfy the interpolation conditions

f(x+hu_{j})=m(x+hu_{j}),\qquad j=1,\dots,n.

(12)

Substituting $y=x+hu_{j}=x_{j}$ $\forall j$ into the model (10) gives the equations

m(x_{j})=f(x)+hu_{j}^{T}g(x)+\tfrac{1}{2}h^{2}u_{j}^{T}D(x)u_{j}.

(13)

Because $D(x)$ is diagonal, $u_{j}^{T}D(x)u_{j}=\sum_{i=1}^{n}d_{i}(u_{j})_{i}^{2}=(u_{j}\odot u_{j})^{T}d(x),$ $\forall j$ . Defining

	$\displaystyle U$	$\displaystyle:=$	$\displaystyle\begin{bmatrix}u_{1}&\dots&u_{n}\end{bmatrix}$		(14)
	$\displaystyle W$	$\displaystyle:=$	$\displaystyle\begin{bmatrix}u_{1}\odot u_{1}&u_{2}\odot u_{2}&\dots&u_{n}\odot u_{n}\end{bmatrix},$		(15)

where $U,W\in\mathbf{R}^{n\times n}$ , allows (13) to be written in matrix form as

\displaystyle\delta\mathbf{f}\overset{\eqref{eq:df}}{=}hU^{T}g(x)+\tfrac{1}{2}h^{2}W^{T}d(x).

(16)

The system (16) is underdetermined, (there are only $n$ equations in $2n$ unknowns), and as the goal is to find unique solutions $g(x)$ and $D(x)={\rm diag}(d(x))$ , an additional $n$ equations are required. Define

\displaystyle h^{\prime}:=\eta h,\qquad\text{where }\eta\neq 1.

(17)

Now, keep the point $x$ , and the set of directions $\{u_{1},\dots,u_{n}\}\subset\mathbf{R}^{n}$ fixed and choose $h^{\prime}$ satisfying (17). A new set of points $\{x_{1}^{\prime},\dots,x_{n}^{\prime}\}$ is defined via the equations

\displaystyle x_{j}^{\prime}=x+h^{\prime}u_{j},\quad\|u_{j}\|_{2}=\tfrac{1}{|h^{\prime}|}\|x_{j}^{\prime}-x\|_{2},\qquad j=1,\dots,n.

(18)

The definition of $h^{\prime}$ in (17) ensures that $x_{j}^{\prime}\neq x_{j}$ $\forall j$ . The model is constructed to satisfy the interpolation conditions

f(x+\eta hu_{j})=m(x+\eta hu_{j})\qquad j=1,\dots,n.

(19)

(That is, $f(x_{j})=m(x_{j})$ and $f(x_{j}^{\prime})=m(x_{j}^{\prime})$ for $j=1,\dots,n$ .) Following similar arguments to those previously established, one arrives at the system

	$\displaystyle\delta\mathbf{f}^{\prime}$	$\displaystyle\overset{\eqref{eq:df}}{=}$	$\displaystyle h^{\prime}U^{T}g(x)+\tfrac{1}{2}(h^{\prime})^{2}W^{T}d(x)$		(20)
		$\displaystyle=$	$\displaystyle\eta hU^{T}g(x)+\tfrac{1}{2}\eta^{2}h^{2}W^{T}d(x).$		(20)

The systems (16) and (20) can be combined into block matrix form as follows:

\displaystyle\begin{bmatrix}\delta\mathbf{f}\\ \delta\mathbf{f}^{\prime}\end{bmatrix}=\begin{bmatrix}hU^{T}&\tfrac{1}{2}h^{2}W^{T}\\ \eta hU^{T}&\tfrac{1}{2}\eta^{2}h^{2}W^{T}\end{bmatrix}\begin{bmatrix}g(x)\\ d(x)\end{bmatrix}.

(21)

It is clear that (22) represents a system of $2n$ equations in 2n unknowns. Performing block row operations on the system (22) (i.e., multiplying the first row by $\eta$ and subtracting the second row, and also multiplying the first row by $\eta^{2}$ and subtracting the second row) gives

\displaystyle\begin{bmatrix}\eta\delta\mathbf{f}-\delta\mathbf{f}^{\prime}\\ \eta^{2}\delta\mathbf{f}-\delta\mathbf{f}^{\prime}\end{bmatrix}=\begin{bmatrix}0&\tfrac{1}{2}h^{2}(\eta-\eta^{2})W^{T}\\ (\eta^{2}-\eta)hU^{T}&0\end{bmatrix}\begin{bmatrix}g(x)\\ d(x)\end{bmatrix}.

(22)

Hence, estimates of the gradient and diagonal of the Hessian can be found by solving

\displaystyle y=hU^{T}g(x),\qquad\text{and}\qquad z=\tfrac{1}{2}h^{2}W^{T}d(x),

(23)

where

	$\displaystyle y$	$\displaystyle:=$	$\displaystyle\tfrac{1}{\eta(\eta-1)}(\eta^{2}\delta\mathbf{f}-\delta\mathbf{f}^{\prime})$		(24)
	$\displaystyle z$	$\displaystyle:=$	$\displaystyle\tfrac{1}{\eta(1-\eta)}(\eta\delta\mathbf{f}-\delta\mathbf{f}^{\prime}).$		(25)

Questions naturally arise, such as ‘when do unique solutions to (23) exist?’ and ‘what is the cost of computing the solutions?’. Clearly, if the matrices $U$ and $W$ have full rank, then unique solutions to (23) exist. The rank of $U$ and $W$ , of course, depends upon how the interpolation points are chosen. If the interpolation directions $\{u_{1},\dots,u_{n}\}$ form a basis for $\mathbf{R}^{n}$ , then a unique solution $g(x)$ exists. Similarly, if the vectors $w_{j}=u_{j}\odot u_{j}$ , for $j=1,\dots,n$ form a basis for $\mathbf{R}^{n}$ , then a unique solution $d(x)$ exists.

Computing the solution to a general unstructured system has a cost of $\mathcal{O}(n^{3})$ flops. The focus of the remainder of this paper is to show that, if one selects the interpolation points/interpolation directions in a specific way, then unique solutions $g(x)$ and $d(x)$ to (23) exist, and moreover, the computational cost of finding the solutions remains low, at $\mathcal{O}(n)$ . This is the same cost as finite difference approximations to the gradient and pure second order derivatives. Importantly, it will also be shown that this specific set of directions does not increase the memory footprint either, requiring only $\mathcal{O}(1)$ vectors of storage.

In what follows, it will be useful to notice that, in the special case $\eta=-1$ , (24) and (25) become

	$\displaystyle y$	$\displaystyle=$	$\displaystyle\tfrac{1}{2}(\delta\mathbf{f}-\delta\mathbf{f}^{\prime}),\qquad[y]_{i}=\tfrac{1}{2}(f(x+hu_{j})-f(x-hu_{j}))$		(26)
	$\displaystyle z$	$\displaystyle=$	$\displaystyle\tfrac{1}{2}(\delta\mathbf{f}+\delta\mathbf{f}^{\prime}),\qquad[z]_{i}=\tfrac{1}{2}(f(x+hu_{j})+f(x-hu_{j})-2f(x)).$		(27)

The Sherman-Morrison formula for a nonsingular matrix $A\in\mathbf{R}^{n\times n}$ , and vectors $u,v\in\mathbf{R}^{n}$ is used many times in this work, and is stated now for convenience:

(A+uv^{T})^{-1}=A^{-1}-\frac{A^{-1}uv^{T}A^{-1}}{1+v^{T}A^{-1}u}.

(28)

3 Computing derivative estimates via interpolation

This section describes how to compute solutions to the diagonal quadratic interpolation model when the coordinate basis is used to construct the interpolation points, and also when the regular basis is employed. In both cases the computational cost of determining the solution is $\mathcal{O}(n)$ , and when the coordinate basis is used with $\eta=-1$ in (17), the usual finite differences approximations to the first and second (pure) derivatives are recovered.

3.1 Interpolation using the coordinate basis

Here we present the solutions $g(x)$ and $d(x)$ to (23) when the coordinate basis $\{e_{1},\dots,e_{n}\}\subset\mathbf{R}^{n}$ is used to generate the interpolation points. Given a point $x\in\mathbf{R}^{n}$ , take $u_{j}=e_{j}$ $\forall j$ , so that $U\equiv I$ . Because $e_{j}\odot e_{j}=e_{j},\;\;\forall j$ , $W\equiv I$ . Subsequently, the points $x_{1},\dots,x_{n},x_{1}^{\prime},\dots,x_{n}^{\prime}$ satisfy $x_{j}=x+he_{j}$ , $x_{j}^{\prime}=x+\eta he_{j}$ , $j=1,\dots n$ . Substituting this into (23), shows that $g(x)=\tfrac{1}{h}y$ and $d(x)=\tfrac{2}{h^{2}}z$ . Now, set $\eta=-1$ . Using (26) in the expression for $g(x)$ gives

\displaystyle g(x)=\tfrac{1}{2h}(\delta\mathbf{f}-\delta\mathbf{f}^{\prime})=\tfrac{1}{2h}(\mathbf{f}-\mathbf{f}^{\prime}),

or equivalently, the $i$ th element of $g(x)$ is

\displaystyle[g(x)]_{i}=\tfrac{1}{2h}(f(x+he_{i})-f(x-he_{i})).

(29)

Notice that (29) is the standard central differences approximation to the gradient, which is known to be $\mathcal{O}(h^{2})$ accurate (see also Section 4.5).

Similarly, substituting (27) in the expression for $d(x)$ gives

\displaystyle d(x)=\tfrac{1}{h^{2}}\left(\delta\mathbf{f}+\delta\mathbf{f}^{\prime}\right)=\tfrac{1}{h^{2}}\left(\mathbf{f}+\mathbf{f}^{\prime}-2f(x)e\right),

or equivalently, the $i$ th element of $d(x)$ is

\displaystyle[d(x)]_{i}=\tfrac{1}{h^{2}}\left(f(x+he_{i})+f(x-he_{i})-2f(x)\right),

(30)

which is the standard, central differences approximation to the diagonal entries of the Hessian.

This confirms that when the coordinate basis is used in the diagonal quadratic interpolation model, with $h^{\prime}=-h$ , then (23) recovers the central finite differences formulas for the gradient and pure second order derivatives.

3.2 Interpolation using a regular basis

This section shows how gradient and diagonal Hessian estimates can be computed in $\mathcal{O}(n)$ flops if a specific basis is used to generate the interpolation points. For ease of reference, we call this a regular basis. (It will be shown that the directions forming the regular basis are $n$ of the internal arms of a regular simplex, see Section 4).

Define the scalars

\displaystyle\alpha:=\sqrt{\frac{n+1}{n}},\qquad\gamma:=\frac{1}{n}\left(1-\sqrt{\frac{1}{n+1}}\right).

(31)

Let $\{v_{1},\dots,v_{n}\}$ be the regular basis for $\mathbf{R}^{n}$ , whose elements are defined by

v_{j}:=\alpha(e_{j}-\gamma e),\qquad j=1,\dots,n,

(32)

where the basis vectors can be collected as the columns of the matrix

\displaystyle V:=\begin{bmatrix}v_{1}&\dots&v_{n}\end{bmatrix}=\alpha(I-\gamma ee^{T})\in\mathbf{R}^{n\times n}.

(33)

Equation (32) shows that the vectors $v_{j}$ are simply a multiple of $e$ (the vector of all ones) with an adjustment to the $j$ th component. Thus, these vectors need not be stored explicitly; they can simply be generated on-the-fly and then discarded. Therefore, the regular basis has a low storage footprint of $\mathcal{O}(1)$ vectors.

In [13, p.569], (see also page 20 and Corollary 2.6 in [10]) it is shown that for $V$ defined in (33),

\displaystyle V^{2}=V^{T}V=\begin{bmatrix}1&-\frac{1}{n}&\dots&-\frac{1}{n}\\ -\frac{1}{n}&1&&\vdots\\ \vdots&&\ddots&-\frac{1}{n}\\ -\frac{1}{n}&\dots&-\frac{1}{n}&1\end{bmatrix},

which shows that the elements of the regular basis have unit length:

\|v_{j}\|_{2}=1,\qquad j=1,\dots,n.

(34)

The following Lemma confirms that $\{v_{1},\dots,v_{n}\}$ is a basis for $\mathbf{R}^{n}$ , and also that $V$ is positive definite.

Lemma 1 (Lemma 2 in [13]).

Let $\alpha$ and $\gamma$ be defined in (31). Then $V$ in (33) is nonsingular. Moreover, $V$ has $n-1$ eigenvalues equal to $\alpha$ , and one eigenvalue equal to $1/\sqrt{n}$ .

Corollary 2.

Let the conditions of Lemma 1 hold. Then $\|V\|_{2}=\alpha$ and $\|V^{-1}\|_{2}=\sqrt{n}$ .

Now, a set of interpolation points are generated using the regular basis. Fix $x$ , $h$ and $\eta$ . Take $V$ as in (33), and define $\{x_{1},\dots,x_{n},x_{1}^{\prime},\dots,x_{n}^{\prime}\}$ via

\displaystyle x_{j}=x+hv_{j},\qquad x_{j}^{\prime}=x+\eta hv_{j},\qquad j=1,\dots,n.

(35)

Note that, because $\|v_{j}\|=1$ for all $j$ by (34), the points $\{x_{1},\dots,x_{n}\}$ all lie on a circle of radius $|h|$ , and the points $\{x_{1}^{\prime},\dots,x_{n}^{\prime}\}$ all lie on a circle of radius $|\eta h|$ . Figure 1 provides an illustration showing the coordinate basis and regular basis in $\mathbf{R}^{2}$ , and how they can be used to generate the interpolation points when $\eta=-1$ (i.e., $h^{\prime}=-h$ ).

(a) The coordinate basis in

\mathbf{R}^{2}

(b) The regular basis in

\mathbf{R}^{2}

Figure 1: Left: The standard coordinate basis

\{e_{1},e_{2}\}\subset\mathbf{R}^{2}

and the directions

-e_{1},-e_{2}

. Also the interpolation points

x_{j}=x+he_{j}

j\in\{1,2\}

and

x_{j}^{\prime}=x+h^{\prime}e_{j}

j\in\{1,2\}

for

h^{\prime}=-h

. Right: The regular basis

\{v_{1},v_{2}\}\subset\mathbf{R}^{2}

and the directions

-v_{1},-v_{2}

. Also the interpolation points

x_{j}=x+hv_{j}

j\in\{1,2\}

and

x_{j}^{\prime}=x+h^{\prime}v_{j}

j\in\{1,2\}

for

h^{\prime}=-h

We are now ready to state the main results of this section, which show that estimates of $g(x)$ and $d(x)$ can be obtained in $\mathcal{O}(n)$ computations when using the model (13) and a regular basis (33).

Theorem 3.

Let $\alpha$ and $\gamma$ be defined in (31), let $V$ be as defined in (33), let $y$ be defined in (24), and let (17) hold. Then the solution to (23) is

\displaystyle g(x)=\tfrac{1}{\alpha h}(y+\tfrac{1}{n}(\sqrt{n+1}-1)(y^{T}e)e),

(36)

which is an $\mathcal{O}(n)$ computation.

Proof.

By Lemma 1, $V$ is nonsingular. Because

\displaystyle 1-n\gamma\overset{\eqref{eq:alphagamma}}{=}1-n\tfrac{1}{n}(1-\tfrac{1}{\sqrt{n+1}})=\tfrac{1}{\sqrt{n+1}},\quad\text{and}\quad\gamma\sqrt{n+1}=\tfrac{1}{n}(\sqrt{n+1}-1),

(37)

applying the Sherman-Morrison-Woodbury formula (28) to $V$ in (33) gives

\displaystyle V^{-1}=\tfrac{1}{\alpha}(I+\tfrac{\gamma}{1-n\gamma}ee^{T})\overset{\eqref{eq:gammaineq1}}{=}\tfrac{1}{\alpha}(I+\gamma\sqrt{n+1}\;ee^{T})\overset{\eqref{eq:gammaineq1}}{=}\tfrac{1}{\alpha}(I+\tfrac{1}{n}(\sqrt{n+1}-1)ee^{T}).

(38)

Hence, taking $U\equiv V$ in (23) gives

\displaystyle g(x)\overset{\eqref{eq:Findgd}}{=}\tfrac{1}{h}V^{-1}y\overset{\eqref{eq:iVtemp}}{=}\tfrac{1}{\alpha h}(I+\tfrac{1}{n}(\sqrt{n+1}-1)ee^{T})y=\tfrac{1}{\alpha h}(y+\tfrac{1}{n}(\sqrt{n+1}-1)(e^{T}y)e).

Lastly, (36) depends upon operations involving vectors only (inner and scalar products and vector additions) so that $g(x)$ is an $\mathcal{O}(n)$ computation. ∎

Corollary 4.

Let the assumptions of Theorem 3 hold, and let $\eta=-1$ so that $y$ is as in (26). Setting $\tilde{f}=\frac{1}{n}\sum_{i=1}^{n}(f(x+hv_{i})-f(x-hv_{i})),$ gives

\displaystyle g(x)=\tfrac{1}{2\alpha h}(\mathbf{f}-\mathbf{f}^{\prime}+(\sqrt{n+1}-1)\tilde{f}e).

Equivalently, the $i$ th component of $g(x)$ is

\displaystyle[g(x)]_{i}=\tfrac{1}{2\alpha h}(f(x+hv_{i})-f(x-hv_{i})+(\sqrt{n+1}-1)\tilde{f}).

(39)

Theorem 3 and Corollary 4 show that using a regular basis in the quadratic interpolation model leads to an estimate for $g(x)$ that is similar to the central finite differences approximation in (29). The formula (29) uses a factor $(2h)^{-1}$ , while the formula (39) uses a factor $(2\alpha h)^{-1}$ . Both (29) and (39) involve the difference between the function values measured along the positive and negative basis directions. However, (39) also contains a ‘correction’ term, which involves the function values measured at all the vertices. Lastly, note that Theorem 3 requires knowledge of $f(x)$ , but in the special case $\eta=-1$ , Corollary 4 shows that $f(x)$ is not used.

Before presenting the next theorem, the following parameters are defined. Let

\mu:=\alpha^{2}(1-2\gamma),\qquad\omega:=\gamma^{2}/(1-2\gamma).

(40)

Remark 5.

Note that $1-2\gamma$ is monotonically increasing in $n$ . Moreover, $n\geq 2$ , and when $n=2$ , $1-2\gamma=1-2\frac{1}{2}(1-\frac{1}{\sqrt{3}})=\frac{1}{\sqrt{3}}>0$ , so $1-2\gamma\neq 0$ for any $n\geq 2$ .

Theorem 6.

Let $\alpha$ and $\gamma$ be defined in (31), let $\mu$ and $\omega$ be defined in (40), let $V$ be defined in (33), let $z$ and $W$ be defined in (25) and (15) respectively, and let (17) hold. Then the solution to (23) is

d(x)=\tfrac{2}{\mu h^{2}}\left(z-\tfrac{1}{n}(1-\mu)(e^{T}z)e\right),

(41)

which is an $\mathcal{O}(n)$ computation.

Proof.

From (32) it can be seen that

(v_{j})_{i}^{2}=(v_{j}\odot v_{j})_{i}=\begin{cases}\alpha^{2}(1-\gamma)^{2}&\text{if }j=i,\\ \alpha^{2}\gamma^{2}&\text{if }j\neq i.\end{cases}

(42)

Thus,

W\overset{\eqref{vjsquared}}{=}\alpha^{2}\begin{bmatrix}(1-\gamma)^{2}&\gamma^{2}&\dots&\gamma^{2}\\ \gamma^{2}&(1-\gamma)^{2}&\dots&\gamma^{2}\\ \vdots&&\ddots&\vdots\\ \gamma^{2}&\gamma^{2}&\dots&(1-\gamma)^{2}\end{bmatrix}\overset{\eqref{eq:muom}}{=}\mu(I+\omega ee^{T}),

(43)

Applying the Sherman-Morrison-Woodbury formula from (28), to $W$ in (43), gives

W^{-1}=\tfrac{1}{\mu}\left(I-\tfrac{\omega}{1+n\omega}ee^{T}\right)=\tfrac{1}{\mu}\left(I-\tfrac{\omega n}{1+n\omega}\tfrac{1}{n}ee^{T}\right)\overset{\eqref{eq:muvsomegan}}{=}\tfrac{1}{\mu}\left(I-\tfrac{1}{n}(1-\mu)ee^{T}\right).

(44)

Thus

\displaystyle d(x)\overset{\eqref{eq:Findgd}}{=}\tfrac{2}{h^{2}}W^{-1}z\overset{\eqref{eq:Winv}}{=}\tfrac{2}{\mu h^{2}}\left(I-\tfrac{1}{n}(1-\mu)ee^{T}\right)z=\tfrac{2}{\mu h^{2}}\left(z-\tfrac{1}{n}(1-\mu)(e^{T}z)e\right).

Finally, (41) depends upon operations involving vectors only (inner and scalar products and vector additions) so that determining $d(x)$ is an $\mathcal{O}(n)$ computation. ∎

Corollary 7.

Let the assumptions of Theorem 6 hold, and let $\eta=-1$ so that $z$ is as in (27). Setting $\bar{f}=\tfrac{1}{n}\sum_{i=1}^{n}(f(x+hv_{i})+f(x-hv_{i})-2f(x))$ gives

\displaystyle d(x)=\tfrac{1}{\mu h^{2}}(\delta\mathbf{f}+\delta\mathbf{f}^{\prime}+(1-\mu)\bar{f}(x)e),

or equivalently, the $i$ th component of $g$ is

\displaystyle[d(x)]_{i}=\tfrac{1}{\mu h^{2}}(f(x+hv_{i})+f(x-hv_{i})-2f(x)+(1-\mu)\bar{f}(x)).

(45)

Theorem 6 and Corollary 7 show that using a regular basis in the quadratic interpolation model leads to an estimate for $d(x)$ that is similar to the central finite differences approximation in (30). The formula (30) uses a factor $1/h^{2}$ , whereas (45) uses a factor $1/\mu h^{2}$ . However, (39) also contains a ‘correction’ term, which is a kind of weighted average of the function values measured at the interpolation points.

4 Interpolation using minimal positive bases

Spendley et al. [32] and Nelder and Mead [28] carried out some of the pioneering work on derivative-free simplex-based algorithms. This work continues today with studies on properties of simplicies [1, 22] as well as work on simplex algorithm development, [31]. These, and many other algorithms in derivative free optimization, employ positive bases and simplices, and involve geometric operations such as expansions, rotations and contractions. In such cases, the original simplex/positive basis involves $n+1$ (or possibly more) points (with their corresponding function values), and if a geometric operation is performed, then this often doubles the number of interpolation points and corresponding function values that are available. As previously mentioned, function values are expensive, so it is prudent to make full use of them.

This section is motivated in part, by the ongoing research on simplex based methods, and the following types of ideas. Consider a simplex based algorithm, where one arrives at an iteration that involves an expansion step. In such a case, one would have access to $2n+2$ function values (measured at the vertices of the 2 resulting original and expanded simplices) as a by-product. If an approximation to the gradient or diagonal of the Hessian was readily available, then this might be useful curvature information that could be built into a simplex based algorithm to help locate the solution more quickly. Moreover, depending on the type of geometric operation involved, the vertices of the 2 simplices are generated from the same set of base directions, one scaled by $h$ , and one scaled by some $h^{\prime}=\eta h$ . Therefore, in this work we have not restricted to $\eta=-1$ , to ensure the results are applicable to expansions, contractions, and not just rotations.

In this section we investigate the use of minimal positive bases (and their associated simplices) to generate interpolation points that can be used in the model (13). We will show that the results from the previous section carry over, in a natural way, when there are $2n+2$ (rather than $2n$ ) interpolation points.

4.1 Positive Bases and Simplices

This section begins with several preliminaries, (see also, for example, [10, 14, 13]).

Definition 8 (Positive Basis).

A set of vectors $\{u_{1},\dots,u_{N}\}\subset\mathbf{R}^{n}$ is called a positive basis for $\mathbf{R}^{n}$ if and only if the following two conditions hold.

(i)

Every vector in $\mathbf{R}^{n}$ is a nonnegative linear combination of the members of $\{u_{1},\dots,u_{N}\}$ , i.e., $\forall\;u\in\mathbf{R}^{n}$ it holds that $\{u\in\mathbf{R}^{n}:u=c_{1}u_{1}+\dots+c_{N}u_{N},\;0\leq c_{i}\in\mathbf{R},i=1,\dots,N\}.$
(ii)

No proper subset of $\{u_{1},\dots,u_{N}\}$ satisfies (i).

Definition 9 (Minimal and Maximal Positive Bases).

A positive basis with $n+1$ elements is called a minimal positive basis. A positive basis with $2n$ elements is called a maximal positive basis.

Lemma 10 (Minimal Positive Basis).

Let $U=\begin{bmatrix}u_{1}&\dots&u_{n}\end{bmatrix}$ be a nonsingular matrix. Then the set $\{u_{1},\dots,u_{n},-Ue\}\subset\mathbf{R}^{n}$ is a minimal positive basis for $\mathbf{R}^{n}$ .

Lemma 10 shows that the set $\{e_{1},\dots,e_{n},-e\}$ is a minimal positive basis for $\mathbf{R}^{n}$ . Lemma 10 also shows that the columns of

V_{+}:=\begin{bmatrix}V&-Ve\end{bmatrix}\in\mathbf{R}^{n\times(n+1)},

(46)

where $V$ is defined in (33), form a minimal positive basis for $\mathbf{R}^{n}$ (recall that $V$ is nonsingular by Lemma 1). It can be shown that (see also Lemma 3 in [13]),

v_{n+1}=-Ve=-\sum_{i=1}^{n}v_{j}\overset{\eqref{eq:V}+\eqref{eq:Vp}}{=}-\frac{1}{\sqrt{n}}e.

(47)

The minimal positive basis $\{v_{1},\dots,v_{n+1}\}$ has many useful properties and it is shown in [13] that the angle between any pair of elements of the minimal positive basis is the same, while (47) shows that one of the directions is aligned with the vector $-e$ . Henceforth, we refer to the columns of $V_{+}$ in (46) as a regular minimal positive basis.

Definition 11 (Affine independence).

A set of $n+1$ points $x_{1}\dots,x_{n+1}\in\mathbf{R}^{n}$ is called affinely independent if the vectors $x_{1}-x_{n+1},\dots,x_{n}-x_{n+1}$ are linearly independent.

Definition 12 (Simplex).

The convex hull of a set of affinely independent points $\{x_{1}\dots,x_{n+1}\}\subset\mathbf{R}^{n}$ is called a simplex of dimension $n$ .

Definition 13 (Regular Simplex).

A regular simplex is a simplex that is also a regular polytope.

Given a point $x$ , and the regular minimal positive basis $\{v_{1},\dots,v_{n+1}\}$ , the sets of points $\{x_{1},\dots,x_{n+1}\}$ and $\{x_{1}^{\prime},\dots,x_{n+1}^{\prime}\}$ formed via

x_{j}=x+hv_{j},\qquad x_{j}^{\prime}=x+h^{\prime}v_{j},\qquad j=1,\dots,n+1,

(48)

are affinely independent. Hence, the convex hull of $\{x_{1},\dots,x_{n+1}\}$ is a regular simplex aligned in the direction $-e$ , and so too is the the convex hull of $\{x_{1}^{\prime},\dots,x_{n+1}^{\prime}\}$ . This explains why we have adopted the terminology ‘regular’, and refer to the columns of $V$ as a regular basis, and the columns of $V_{+}$ as a regular minimal positive basis; both are related to a regular simplex (see Section 3.1 in [13]).

(a) Coordinate minimal positive basis.

(b) Regular minimal positive basis

Figure 2: An illustration of minimal positive bases in

\mathbf{R}^{2}

. Left: The coordinate minimal positive basis

\{e_{1},e_{2},-e\}\subset\mathbf{R}^{2}

, the points

x_{1}=x+he_{1}

x_{2}=x+he_{2}

, and

x_{3}=x-he

, as well as the simplex formed by taking the convex hull of

\{x_{1},x_{2},x_{3}\}

. Right: The regular minimal positive basis

\{v_{1},v_{2},v_{3}\}\subset\mathbf{R}^{2}

, the points

x_{1}=x+hv_{1}

x_{2}=x+hv_{2}

, and

x_{3}=x+hv_{3}

, as well as the simplex formed by taking the convex hull of

\{x_{1},x_{2},x_{3}\}

(a)

0<h<h^{\prime}

(b)

h^{\prime}<0<h

(c)

h<h^{\prime}<0

Figure 3: A schematic of regular simplices and the results of geometric operations (expansion/contraction/rotation).

4.2 Interpolation using a minimal positive basis

In this section we show that estimates of the gradient and diagonal Hessian can be obtained in $\mathcal{O}(n)$ computations when a regular minimal positive basis is used to generate the interpolation points. Hence, let $\{u_{1},\dots,u_{n}\}$ be a basis for $\mathbf{R}^{n}$ , so that $U=\begin{bmatrix}u_{1},\dots,u_{n}\end{bmatrix}$ is nonsingular. Moreover, let $U_{+}=\begin{bmatrix}U&-Ue\end{bmatrix}$ , so that, by Lemma 10 the columns of $U_{+}$ are a minimal positive basis for $\mathbf{R}^{n}$ . Let $W$ be as in (15) and define

W_{+}:=\begin{bmatrix}W&u_{n+1}\odot u_{n+1}\end{bmatrix}\in\mathbf{R}^{n\times(n+1)}.

(49)

Following the same arguments as those presented in Section 2, estimates to the gradient and diagonal of the Hessian can be found by solving

\displaystyle y_{+}=hU_{+}^{T}g(x),\qquad\text{and}\qquad z_{+}=\tfrac{1}{2}h^{2}W_{+}^{T}d(x),

(50)

where

\displaystyle y_{+}=\begin{bmatrix}y\\ y_{n+1}\end{bmatrix}:=-\frac{1}{\eta(1-\eta)}\left(\eta^{2}\delta\mathbf{f}_{+}-\delta\mathbf{f}_{+}^{\prime}\right)\in\mathbf{R}^{n+1}

(51)

and

\displaystyle z_{+}=\begin{bmatrix}z\\ z_{n+1}\end{bmatrix}:=\frac{1}{\eta(1-\eta)}\left(\eta\delta\mathbf{f}_{+}-\delta\mathbf{f}_{+}^{\prime}\right)\in\mathbf{R}^{n+1}.

(52)

Clearly, the matrices in (50) are not square, so that $g(x)$ and $d(x)$ will be computed as least squares solutions to (50) in this section.

4.3 Interpolation using the coordinate minimal positive basis

Here, take $U=I$ , so that $U_{+}=I_{+}=\begin{bmatrix}I&-e\end{bmatrix}$ , which corresponds to the coordinate minimal positive basis $\{e_{1},\dots,e_{n},-e\}$ . The following results hold.

Theorem 14.

Let $y_{+}$ be defined in (51). Then, using the coordinate minimal basis $\{e_{1},\dots,e_{n},-e\}$ , the least squares solution to (50) is

\displaystyle g(x)=\tfrac{1}{h}(y-\tfrac{1}{n+1}(e^{T}y_{+})e).

Proof.

Note that

\displaystyle I_{+}I_{+}^{T}=\begin{bmatrix}I&-e\end{bmatrix}\begin{bmatrix}I\\ -e^{T}\end{bmatrix}=I+ee^{T},\qquad(I_{+}I_{+}^{T})^{-1}\overset{\eqref{eq:SMW}}{=}I-\tfrac{1}{n+1}ee^{T},

(53)

and

\displaystyle I_{+}y_{+}=\begin{bmatrix}I&-e\end{bmatrix}\begin{bmatrix}y\\ y_{n+1}\end{bmatrix}=y-y_{n+1}e.

(54)

Combining the previous two gives shows that the least squares solution to (50) is

$\displaystyle g(x)$	$\displaystyle=$	$\displaystyle\tfrac{1}{h}(I_{+}I_{+}^{T})^{-1}I_{+}y_{+}$
	$\displaystyle\overset{\eqref{eq:Iplusinv}+\eqref{eq:Iyp}}{=}$	$\displaystyle\tfrac{1}{h}(I-\tfrac{1}{n+1}ee^{T})(y-y_{n+1}e)$
	$\displaystyle=$	$\displaystyle\tfrac{1}{h}(y-y_{n+1}e-\tfrac{1}{n+1}(e^{T}y-y_{n+1}n)e)$
	$\displaystyle=$	$\displaystyle\tfrac{1}{h}(y-\tfrac{1}{n+1}y_{n+1}e-\tfrac{1}{n+1}(e^{T}y)e)$
	$\displaystyle=$	$\displaystyle\tfrac{1}{h}(y-\tfrac{1}{n+1}(e^{T}y_{+})e).$

∎

Corollary 15.

Let the assumptions of Theorem 14 hold. Letting $\eta=-1$ and letting
$\tilde{f}_{+}=\frac{1}{n+1}(f(x+he)-f(x-he)+\sum_{i=1}^{n}f(x+he_{i})-f(x-he_{i}))$ gives

\displaystyle g(x)=\tfrac{1}{2h}(\mathbf{f}-\mathbf{f}^{\prime}-\tilde{f}_{+}e),

or equivalently

\displaystyle[g(x)]_{i}=\tfrac{1}{2h}(f(x+he_{i})-f(x-he_{i})-\tilde{f}_{+}).

Corollary 15 shows that when the coordinate minimal positive basis is used in the quadratic interpolation model, the expression for $g(x)$ is similar to the central differences formula, except an additional shift term $\tilde{f}_{+}$ that involves the function values at all the interpolation points, is also added to each element.

Theorem 16.

Let $z_{+}$ be defined in (52). Then, using the coordinate minimal positive basis $\{e_{1},\dots,e_{n},-e\}$ , the least squares solution to (50) is

\displaystyle d(x)=\tfrac{2}{h^{2}}(z-\tfrac{1}{n+1}(e^{T}z)e+\tfrac{1}{n+1}z_{n+1}e).

(55)

Proof.

For the coordinate minimal basis $I_{+}=\begin{bmatrix}I&-e\end{bmatrix}$ , so $W_{+}=\begin{bmatrix}I&e\end{bmatrix}$ because $-e\odot-e=e$ . Hence, $W_{+}W_{+}^{T}=I+ee^{T}$ and $(W_{+}W_{+}^{T})^{-1}=I-\tfrac{1}{n+1}ee^{T}$ , by (53). Similarly to (54), $W_{+}z_{+}=z+z_{n+1}e$ . Combining these shows that the least squares solution to (50) is

$\displaystyle d(x)$	$\displaystyle=$	$\displaystyle\tfrac{2}{h^{2}}(W_{+}W_{+}^{T})^{-1}W_{+}z_{+}$
	$\displaystyle\overset{\eqref{eq:Iyp}}{=}$	$\displaystyle\tfrac{2}{h^{2}}(I-\tfrac{1}{n+1}ee^{T})(z+z_{n+1}e)$
	$\displaystyle=$	$\displaystyle\tfrac{2}{h^{2}}(z+z_{n+1}e-\tfrac{1}{n+1}(e^{T}z+z_{n+1}n)e)$
	$\displaystyle=$	$\displaystyle\tfrac{2}{h^{2}}(z-\tfrac{1}{n+1}(e^{T}z)e+\tfrac{1}{n+1}z_{n+1}e).$

∎

Corollary 17.

Let the assumptions of Theorem 14 hold. Setting $\eta=-1$ and letting
$\bar{f}_{+}=\frac{1}{n+1}\left((f(x+he)+f(x-he)-2f(x))-\sum_{i=1}^{n}(f(x+he_{i})+f(x-he_{i})-2f(x))\right)$ gives

\displaystyle d(x)

\displaystyle=

\displaystyle\tfrac{1}{h^{2}}(\mathbf{f}+\mathbf{f}^{\prime}-2f(x)e-\bar{f}_{+}e),

or equivalently

\displaystyle[d(x)]_{i}=\tfrac{1}{h^{2}}(f(x+he_{i})+f(x-he_{i})-2f(x)-\bar{f}_{+}).

4.4 Interpolation using a regular minimal positive basis

Here, take $U=V$ with $U_{+}=V_{+}$ , where $V$ is defined in (33). Furthermore, let $W$ be as in (15) with

W_{+}:=\begin{bmatrix}W&v_{n+1}\odot v_{n+1}\end{bmatrix}\overset{\eqref{eq:vnp1}}{=}\begin{bmatrix}W&\tfrac{1}{n}e\end{bmatrix}\in\mathbf{R}^{n\times(n+1)}.

(56)

We are now ready to present the main results of this section.

Theorem 18.

Let $\alpha$ and $\gamma$ be defined in (31), let $V_{+}$ be the regular minimal positive basis in (46), let $y_{+}$ be as in (51), and let (17) hold. Then the least squares solution to (50) is

\displaystyle g(x)=\tfrac{1}{\alpha h}\left(y-(\gamma(e^{T}y)+\tfrac{1}{\sqrt{n+1}}y_{n+1})e\right),

(57)

which can be computed in $\mathcal{O}(n)$ .

Proof.

Using (46), it can be shown that $V_{+}V_{+}^{T}=\alpha^{2}I$ (see also Lemma 4 in [13]), so the least squares solution to (50) is $g(x)=\frac{1}{h}(V_{+}V_{+}^{T})^{-1}V_{+}y_{+}=\frac{1}{\alpha^{2}h}V_{+}y_{+}$ . Combining with

$\displaystyle V_{+}y_{+}$	$\displaystyle\overset{\eqref{eq:Vp}}{=}$	$\displaystyle\begin{bmatrix}V&-Ve\end{bmatrix}\begin{bmatrix}y\\ y_{n+1}\end{bmatrix}$
	$\displaystyle\overset{\eqref{eq:vnp1}}{=}$	$\displaystyle Vy-\tfrac{1}{\sqrt{n}}y_{n+1}e$
	$\displaystyle\overset{\eqref{eq:Vp}}{=}$	$\displaystyle\alpha(y-\gamma(e^{T}y)e)-\tfrac{1}{\sqrt{n}}y_{n+1}e$
	$\displaystyle=$	$\displaystyle\alpha(y-(\gamma(e^{T}y)+\tfrac{1}{\alpha\sqrt{n}}y_{n+1})e)$
	$\displaystyle\overset{\eqref{eq:alphagamma}}{=}$	$\displaystyle\alpha(y-(\gamma(e^{T}y)+\tfrac{1}{\sqrt{n+1}}y_{n+1})e),$

gives (57). Finally, (57) depends upon operations involving vectors only; an $\mathcal{O}(n)$ computation. ∎

Corollary 19.

Let the assumptions of Theorem 18 hold. Letting $\eta=-1$ and letting
$\tilde{f}_{+}=\tfrac{1}{\sqrt{n+1}}(f(x+hv_{n+1})-f(x-hv_{n+1}))+\gamma\sum_{i=1}^{n}f(x+hv_{i})-f(x-hv_{i})$ gives

\displaystyle g(x)=\tfrac{1}{2\alpha h}\left(\mathbf{f}-\mathbf{f}^{\prime}-\tilde{f}_{+}e\right),

(58)

or equivalently, the $i$ th component of $g(x)$ is

\displaystyle[g(x)]_{i}=\tfrac{1}{2\alpha h}\left(f(x+hv_{i})-f(x-hv_{i})-\tilde{f}_{+}\right),

(59)

Before presenting the next theorem, define

\sigma:=2\omega+\omega^{2}n+\tfrac{1}{\mu^{2}n^{2}}.

(60)

Theorem 20.

Let $\alpha$ and $\gamma$ be defined in (31), let $\mu$ and $\omega$ be defined in (40), and let $\sigma$ be defined in (60). Moreover, let $V_{+}$ be the oriented regular simplex defined in (46), let $z_{+}$ and $W_{+}$ be defined in (52) and (56) respectively, and let (17) hold. Then least squares solution to (50) is

d(x)=\tfrac{2}{\mu h^{2}}\left(z+\frac{1}{1+\sigma n}\left((\omega-\sigma)(e^{T}z)+\tfrac{1}{\mu n}z_{n+1}\right)e\right),

(61)

which can be computed in $\mathcal{O}(n)$ .

Proof.

Note that

$\displaystyle W_{+}W_{+}^{T}$	$\displaystyle\overset{\eqref{eq:Wp}}{=}$	$\displaystyle WW^{T}+\tfrac{1}{n^{2}}ee^{T}$	(62)
	$\displaystyle\overset{\eqref{eq:W}}{=}$	$\displaystyle\mu^{2}(I+\omega ee^{T})^{2}+\tfrac{1}{n^{2}}ee^{T}$
	$\displaystyle=$	$\displaystyle\mu^{2}(I+(2\omega+\omega^{2}n)ee^{T})+\tfrac{1}{n^{2}}ee^{T}$
	$\displaystyle\overset{\eqref{eq:sigma}}{=}$	$\displaystyle\mu^{2}\left(I+\sigma ee^{T}\right).$

Applying the Sherman-Morrison-Woodbury formula (28) to (62) gives

(W_{+}W_{+}^{T})^{-1}=\tfrac{1}{\mu^{2}}\left(I-\tfrac{\sigma}{1+\sigma n}ee^{T}\right).

(63)

Furthermore,

$\displaystyle W_{+}z_{+}$	$\displaystyle\overset{\eqref{eq:Wp}+\eqref{eq:zplus}}{=}$	$\displaystyle Wz+\tfrac{1}{n}{z}_{n+1}e$	(64)
	$\displaystyle\overset{\eqref{eq:W}}{=}$	$\displaystyle\mu\left(z+\omega(e^{T}z)e\right)+\tfrac{1}{n}z_{n+1}e$
	$\displaystyle=$	$\displaystyle\mu\left(z+(\omega(e^{T}z)+\tfrac{1}{n\mu}z_{n+1})e\right).$

The least squares solution to (50) is

$\displaystyle d$	$\displaystyle=$	$\displaystyle\tfrac{2}{h^{2}}(W_{+}W_{+}^{T})^{-1}W_{+}z_{+}$
	$\displaystyle\overset{\eqref{eq:WpWpTinv}+\eqref{eq:Wz}}{=}$	$\displaystyle\tfrac{2}{\mu h^{2}}\left(I-\tfrac{\sigma}{1+\sigma n}ee^{T}\right)\left(z+(\omega(e^{T}z)+\tfrac{1}{n\mu}z_{n+1})e\right)$
	$\displaystyle=$	$\displaystyle\tfrac{2}{\mu h^{2}}\left(z+(\omega(e^{T}z)+\tfrac{z_{n+1}}{\mu n})e-\tfrac{\sigma((e^{T}z)+(\omega(e^{T}z)+\tfrac{z_{n+1}}{\mu n})n)}{1+\sigma n}e\right)$
	$\displaystyle=$	$\displaystyle\tfrac{2}{\mu h^{2}}\left(z+\tfrac{1}{1+\sigma n}\left((\omega-\sigma)(e^{T}z)+\tfrac{1}{\mu n}z_{n+1}\right)e\right).$

Finally, (61) depends only upon operations involving vectors only (inner and scalar products and vector additions) so that $d$ is an $\mathcal{O}(n)$ computation. ∎

Corollary 21.

Let the assumptions of Theorem 20 hold. Setting $\eta=-1$ and letting $\bar{f}_{+}=\tfrac{1}{\mu n}(f(x+hv_{n+1})+f(x-hv_{n+1})-2f(x))+(\omega-\sigma)\sum_{i=1}^{n}(f(x+he_{i})+f(x-he_{i})-2f(x))$ gives

\displaystyle d(x)

\displaystyle=

\displaystyle\tfrac{1}{h^{2}}(\mathbf{f}+\mathbf{f}^{\prime}-2f(x)e-\tfrac{1}{1+\sigma n}\bar{f}_{+}e).

4.5 Error bounds for the interpolation model

The following is a standard error bound; the proof is included for completeness.

Theorem 22.

\displaystyle x_{j}=x+hu_{j},\qquad x_{j}^{\prime}=x-hu_{j},\qquad j=1,\dots,n+1.

(65)

Let $U_{+}=\begin{bmatrix}u_{1},\dots,u_{n+1}\end{bmatrix}$ and suppose that $f$ is continuously differenctiable and that $\nabla^{2}f(x)$ is $M$ -Lipschitz continuous. Then, the least squares solution $g(x)$ to (50) satisfies

\displaystyle\|g(x)-\nabla f(x)\|_{2}\leq\tfrac{1}{6}Mh^{2}\|U_{+}^{\dagger}\|_{2}\sqrt{n+1}.

(66)

Proof.

We begin by noticing that

	$\displaystyle(U_{+}U_{+}^{T})(g(x)-\nabla f(x))$	$\displaystyle\overset{\eqref{eq:FDgminpb}}{=}$	$\displaystyle\tfrac{1}{h}U_{+}(y_{+}-hU_{+}^{T}\nabla f(x))$		(67)
		$\displaystyle\overset{\eqref{eq:yplus}}{=}$	$\displaystyle\tfrac{1}{h}U_{+}(\tfrac{1}{2}(\delta\mathbf{f}_{+}-\delta\mathbf{f}_{+}^{\prime})-hU_{+}^{T}\nabla f(x))$		(67)

By (6) and (65), the $i$ th element of (67) is

	$\displaystyle[\tfrac{1}{2}(\delta\mathbf{f}_{+}-\delta\mathbf{f}_{+}^{\prime})-hU_{+}^{T}\nabla f(x)]_{i}$	$\displaystyle=$	$\displaystyle\tfrac{1}{2}\big{(}(f_{j}-f(x))-(f_{j}^{\prime}-f(x))\big{)}-(x_{j}-x)^{T}\nabla f(x)$		(68)
		$\displaystyle=$	$\displaystyle\tfrac{1}{2}\big{(}f_{j}-f_{j}^{\prime}\big{)}-(x_{j}-x)^{T}\nabla f(x).$		(68)

The integral form of the Mean Value Theorem provides the identities

	$\displaystyle f_{j}-f(x)$	$\displaystyle=$	$\displaystyle(x_{j}-x)^{T}\int_{0}^{1}\nabla f(x+t(x_{j}-x))dt$		(69)
	$\displaystyle f_{j}^{\prime}-f(x)$	$\displaystyle=$	$\displaystyle-(x_{j}-x)^{T}\int_{0}^{1}\nabla f(x-t(x_{j}-x))dt,$		(70)

because $x_{j}-x=x-x_{j}^{\prime}=-(x_{j}^{\prime}-x)$ , by (65). By [30, p.14], because $f$ is twice continuously differentiable, for $p\in\mathbf{R}^{n}$ ,

\displaystyle\nabla f(x+p)-\nabla f(x)=\int_{0}^{1}\nabla^{2}f(x+\tau p)p\;\;d\tau.

(71)

Combining (69), (70) and (71), and subsequently subtracting $(x_{j}-x)^{T}\nabla f(x)$ shows that for each $j=1,\dots,n$ :

			$\displaystyle\tfrac{1}{2}(f_{j}-f_{j}^{\prime})-(x_{j}-x)^{T}\nabla f(x)$
		$\displaystyle=$	$\displaystyle\tfrac{1}{2}(x_{j}-x)^{T}\int_{0}^{1}\Big{(}\nabla f(x+t(x_{j}-x))-\nabla f(x)\Big{)}+\Big{(}\nabla f(x-t(x_{j}-x))-\nabla f(x)\Big{)}dt$
		$\displaystyle\overset{\eqref{eq:MVT2}}{=}$	$\displaystyle\tfrac{1}{2}(x_{j}-x)^{T}\int_{0}^{1}\int_{0}^{1}\nabla^{2}f(x+\tau t(x_{j}-x))t(x_{j}-x)-\nabla^{2}f(x-\tau t(x_{j}-x))t(x_{j}-x)dtd\tau$
		$\displaystyle\leq$	$\displaystyle\tfrac{1}{2}\\|x_{j}-x\\|\int_{0}^{1}\int_{0}^{1}\\|\nabla^{2}f(x+\tau t(x_{j}-x))t(x_{j}-x)-\nabla^{2}f(x-\tau t(x_{j}-x))t(x_{j}-x)\\|dtd\tau$
		$\displaystyle\leq$	$\displaystyle\tfrac{1}{2}\\|x_{j}-x\\|\int_{0}^{1}\int_{0}^{1}\\|\nabla^{2}f(x+\tau t(x_{j}-x))-\nabla^{2}f(x-\tau t(x_{j}-x))\\|\\|t(x_{j}-x)\\|dtd\tau$
		$\displaystyle\leq$	$\displaystyle\\|x_{j}-x\\|^{2}\int_{0}^{1}\int_{0}^{1}\\|\nabla^{2}f(x+\tau t(x_{j}-x))-\nabla^{2}f(x-\tau t(x_{j}-x))\\|tdtd\tau$
		$\displaystyle\leq$	$\displaystyle M\\|x_{j}-x\\|^{3}\int_{0}^{1}\int_{0}^{1}\tau t^{2}dtd\tau$
		$\displaystyle=$	$\displaystyle\tfrac{1}{6}Mh^{3}.$

Hence, taking the 2-norm of (67), using the calculation above, and recalling the standard norm inequality $\|\cdot\|_{\infty}\leq\sqrt{n+1}\|\cdot\|_{2}$ (for vectors in $\mathbf{R}^{n+1}$ ), gives (75). ∎

4.6 Summary for quadratic interpolation models

The following tables summarize our results for diagonal quadratic interpolations models. The formulae for $g(x)$ are expressed in terms of $y_{+}$ defined in (51). The error bounds give $\kappa$ , where $\|g(x)-\nabla f(x)\|\leq\tfrac{1}{6}Mh^{2}\kappa$ , and holds when $\eta=-1$ .

	$g(x)$	Error $\kappa$	# Samples	Theorem
CB	$\tfrac{1}{h}y$	$\sqrt{n}$	$2n$
RB	$\tfrac{1}{\alpha h}(y+\gamma\sqrt{n+1}(y^{T}e)e)$	$n$	$2n$	Thm 3
CMPB	$\tfrac{1}{h}(y-\tfrac{1}{n+1}(e^{T}y_{+})e)$	$\sqrt{n+1}$	$2n+2$	Thm 14
RMPB	$\tfrac{1}{\alpha h}\left(y-(\gamma(e^{T}y)+\tfrac{1}{\sqrt{n+1}}y_{n+1})e\right)$	$\sqrt{n}$	$2n+2$	Thm 18

Table 1: CB = Coordinate Basis, RB = Regular Basis, CMPB = Coordinate Minimal Positive Basis, RMPB = Regular Minimal Positive Basis. Quadratic model summary for the gradients.

	$d(x)$	# Samples	Theorem
CB	$\tfrac{2}{h^{2}}z$	$2n$
RB	$\tfrac{2}{\mu h^{2}}\left(z-\tfrac{1}{n}(1-\mu)(e^{T}z)e\right)$	$2n$	Thm 6
CMPB	$\tfrac{2}{h^{2}}(z-\tfrac{1}{n+1}(e^{T}z)e+\tfrac{1}{n+1}z_{n+1}e)$	$2n+2$	Thm 16
RMPB	$\tfrac{2}{\mu h^{2}}\left(z+\frac{1}{1+\sigma n}\left((\omega-\sigma)(e^{T}z)+\tfrac{1}{\mu n}z_{n+1}\right)e\right)$	$2n+2$	Thm 20

Table 2: CB = Coordinate Basis, RB = Regular Basis, CMPB = Coordinate Minimal Positive Basis, RMPB = Regular Minimal Positive Basis. Quadratic model summary for the diagonal of the Hessian.

We conclude this section by remarking that, in the special case when $n=2$ , if a minimal positive basis is used to generate the points $\{x_{1},x_{2},x_{3},x_{1}^{\prime},x_{2}^{\prime},x_{3}^{\prime}\}$ then there is enough information to uniquely determine all entries of $H(x)$ (recall that $H(x)\approx\nabla^{2}f(x))$ , because $2n+2=6=(n+1)(n+2)/2$ function values are available, i.e., one can approximate $\nabla^{2}f(x)$ rather than only ${\rm diag}(\nabla^{2}f(x))$ . The algebra is straightforward, so is omitted for brevity.

5 Linear Interpolation Models

While quadratic interpolation models lead to accurate approximations to the gradient and diagonal of the Hessian, they require at least $2n$ function evaluations (and more if the full Hessian is desired). For some applications, this cost is too high, and a linear interpolation model is preferred. In this case, one can obtain estimates of the gradient in $\mathcal{O}(n)$ computations, and this section details how. As usual, letting $g(x)$ denote an approximation to the gradient $\nabla f(x)$ , a linear model is

m(y)=f(x)+(y-x)^{T}g(x).

(72)

Estimations of the gradient $g(x)$ can be obtained easily from a linear model, although the approximations from such a model (using $n$ or $n+1$ function evaluations depending on the interpolation points) are $\mathcal{O}(h)$ accurate (in the previous cases with $2n$ or $2n+2$ function evaluations, the approximations are $\mathcal{O}(h^{2})$ accurate).

The interpolation conditions are simply $f(x+hu_{j})=m(x+hu_{j})$ $\forall j$ . If there are $n$ interpolation points (formed from a unit basis for $\mathbf{R}^{n})$ then $g(x)$ is the solution to

\delta\mathbf{f}=hU^{T}g(x),

(73)

while if there are $n+1$ interpolation points (i.e., a minimal basis for $\mathbf{R}^{n})$ then $g(x)$ is the solution to

\delta\mathbf{f}_{+}=hU_{+}^{T}g(x).

(74)

An explicit solution to (73) is known for a coordinate basis, and solutions to (74) are known for the coordinate minimal positive basis and a regular minimal positive basis.

An explicit solution to (73) for a regular basis (i.e., $V$ in (33)) is as follows. By (73), $g(x)=\frac{1}{h}V^{-1}\delta\mathbf{f}$ . Now, following the same arguments as the proof of Theorem 3 gives $g(x)=\frac{1}{\alpha h}(\delta\mathbf{f}+\gamma\sqrt{n+1}(e^{T}\delta\mathbf{f})e)$ , or componentwise, for $j=1,\dots,n$ ,

\displaystyle[g(x)]_{i}=\tfrac{1}{\alpha h}(f(x+hv_{i})-f(x)+\gamma\sqrt{n+1}(e^{T}\delta\mathbf{f}))

The following theorem provides the standard bound for linear interpolation models.

Theorem 23.

Let $\{u_{1},\dots,u_{n+1}\}$ be a minimal positive basis for $\mathbf{R}^{n}$ and let $h\in\mathbf{R}$ . Suppose points $\{x_{1},\dots,x_{n+1},x_{1}^{\prime},\dots,x_{n+1}^{\prime}\}$ are constructed via $x_{j}=x+hu_{j}$ , $x_{j}^{\prime}=x+\eta hu_{j}$ $j=1,\dots,n+1$ . Let $U_{+}=\begin{bmatrix}u_{1},\dots,u_{n+1}\end{bmatrix}$ and suppose that $f$ is continuously differentiable with an $L$ -Lipschitz continuous gradient. Then, the least squares solution $g(x)$ to (73) satisfies

\displaystyle\|g(x)-\nabla f(x)\|_{2}\leq\tfrac{1}{2}Lh\|U_{+}^{\dagger}\|_{2}\sqrt{n+1}.

(75)

It is clear that if $n$ points are used to generate the interpolation model, then the bound above becomes $\|g(x)-\nabla f(x)\|_{2}\leq\tfrac{1}{2}Lh\|U^{-1}\|_{2}\sqrt{n}$ .

The results for linear interpolation models are summarized in Table 3. The error bounds give $\kappa_{L}$ , where $\|g(x)-\nabla f(x)\|\leq\tfrac{1}{2}Lh\kappa_{L}$ , and holds when $\eta=-1$ .

	g(x)	Error $\kappa_{L}$	# Samples	Ref
CB	$\frac{1}{h}\delta\mathbf{f}$	$\sqrt{n}$	$n$
RB	$\frac{1}{\alpha h}(\delta\mathbf{f}+\gamma\sqrt{n+1}(e^{T}\delta\mathbf{f})e)$	$n$	$n$	This work
CMPB	$\tfrac{1}{h}(\mathbf{f}-\tfrac{1}{n+1}(e^{T}\mathbf{f}_{+})e)$	$\sqrt{n+1}$	$n+1$	[15]
RMPB	$\tfrac{1}{\alpha h}\left(\mathbf{f}-(\gamma(e^{T}\mathbf{f})-\tfrac{1}{\sqrt{n+1}}f_{n+1})e\right)$	$\sqrt{n}$	$n+1$	[13]

Table 3: CB = Coordinate Basis, RB = Regular Basis, CMPB = Coordinate Minimal Positive Basis, RMPB = Regular Minimal Positive Basis. Linear model summary.

6 Numerical Experiments

Here we present numerical examples to verify the results of this paper, and to provide a concrete application to show that the results have practical use. All experiments are performed using MATLAB (version 2018a).

6.1 Derivative Computations

Here we depart from our usual notation and let $y\in\mathbf{R}^{2}$ with components $y=\begin{bmatrix}y_{1}&y_{2}\end{bmatrix}^{T}$ so that Rosenbrock’s function can be written as

\displaystyle f(y_{1},y_{2})=(1-y_{1})^{2}+100(y_{2}-y_{1}^{2})^{2}.

(76)

The gradient of (76) is

\displaystyle\nabla f(y_{1},y_{2})=\begin{bmatrix}-2(1-y_{1})-400y_{1}(y_{2}-y_{1}^{2})\\ 200(y_{2}-y_{1}^{2})\end{bmatrix}

(77)

and the Hessian of (76) is

\displaystyle\nabla^{2}f(y_{1},y_{2})=\begin{bmatrix}2-400y_{2}+1200y_{1}^{2}&-400y_{1}\\ -400y_{1}&200\end{bmatrix}.

(78)

Henceforth, we return to our usual notation.

For the first numerical experiment we simply use the results presented in this work to compute approximations to the gradient and diagonal of the Hessian at two different points, using Rosenbrock’s function. The points are chosen to be those where an accurate gradient is needed to make algorithmic progress on Rosenbrock’s function.

Point near valley floor

The first point is

x=\begin{bmatrix}1.1\\ 1.1^{2}+10^{-5}\end{bmatrix},

(79)

which was chosen because it is close to the “valley floor” for Rosenbrock’s function where a good approximation to the gradient is required to make algorithmic progress.

The aligned regular simplex is constructed using the approach previously presented. Here, $n=2$ so that $\alpha=\sqrt{3/2}$ and $\gamma=\frac{1}{2}(1-\frac{1}{\sqrt{3}})$ by (31). Recalling that $V=\alpha(I-\gamma ee^{T})$ (33) gives (to 4d.p.)

\displaystyle V_{+}=\begin{bmatrix}v_{1}&v_{2}&v_{3}\end{bmatrix}=\begin{bmatrix}0.9659&-0.2588&-0.7071\\ -0.2588&0.9659&-0.7071\end{bmatrix}

(80)

For this experiment $h=10^{-3}$ was used. For the regular basis (RB), the points were generated as $x_{j}=x+hv_{j}$ , $x_{j}^{\prime}=x-hv_{j}$ for $j=1,2$ , while the regular minimal positive basis (RMPB) used $x_{j}=x+hv_{j}$ , $x_{j}^{\prime}=x-hv_{j}$ for $j=1,2,3$ . For the coordinate basis (CB), the points were generated as $x_{j}=x+he_{j}$ , $x_{j}^{\prime}=x-he_{j}$ for $j=1,2$ , while the coordinate minimal positive basis used the previous points as well as $x_{3}=x-he$ and $x_{3}^{\prime}=x+he$ .

Substituting (79) into (77) and (78) shows that the analytic gradient and Hessian are

\displaystyle\nabla f(x)=\begin{bmatrix}0.19559999\\ 0.00200000\end{bmatrix},\qquad{\rm diag}(\nabla^{2}f(x))=10^{2}\times\begin{bmatrix}9.69996000\\ 2.00000000\end{bmatrix}

(81)

Approximations to the gradient and Hessian were built using the theorems presented earlier in this work, and are stated in Table 4. Also stated in the table is

\displaystyle\epsilon_{g}=\|g(x)-\nabla f(x)\|_{2},\qquad\epsilon_{d}=\|d(x)-{\rm diag}(\nabla^{2}f(x))\|_{2}.

That is, $\epsilon_{g}$ is the computed error between the true gradient reported in (83) and the computed approximation. Similarly, $\epsilon_{d}$ is the computed error between the true Hessian diagonal reported in (83) and the computed approximation. Note that, for the given point (79), the theoretical error bound for the gradient approximations is $\tfrac{1}{6}Mh^{2}\|U_{+}^{{\dagger}}\|_{2}\sqrt{n+1}$ , where the Lipschitz constant is estimated as $M=2.3081\times 10^{3}$ , so that $\tfrac{1}{6}Mh^{2}\sqrt{n}=5.4403\times 10^{-4}$ .

	$g(x)$	$\epsilon_{g}$	$d(x)$	$\epsilon_{d}$
CB	$\begin{bmatrix}0.19603999\\ 0.00200000\end{bmatrix}$	$4.39\times 10^{-4}$	$10^{2}\times\begin{bmatrix}9.69996199\\ 1.99999999\end{bmatrix}$	$1.99\times 10^{-4}$
RB	$\begin{bmatrix}0.19608999\\ 0.00211000\end{bmatrix}$	$5.02\times 10^{-4}$	$10^{2}\times\begin{bmatrix}11.89996197\\ 4.19999997\end{bmatrix}$	$3.11\times 10^{2}$
CMPB	$\begin{bmatrix}0.19597333\\ 0.00193333\end{bmatrix}$	$3.79\times 10^{-4}$	$10^{2}\times\begin{bmatrix}6.76662867\\ -0.93333333\end{bmatrix}$	$4.15\times 10^{2}$
RMPB	$\begin{bmatrix}0.19592999\\ 0.00195000\end{bmatrix}$	$3.33\times 10^{-4}$	$10^{2}\times\begin{bmatrix}9.69996175\\ 1.99999975\end{bmatrix}$	$1.77\times 10^{-4}$

Table 4: The computed gradient and diagonal Hessian approximations using the Coordinate Basis, the Regular Basis, the Coordinate Minimal Positive Basis and the Regular Minimal Positive Basis. The computed error in each approximation is also reported.

Table 4 shows the the computed gradient approximations are all close to the true solution (83). The error in the gradient for each approximation is $\approx 10^{-4}$ , which is below the theoretical bound, as expected. For a coordinate basis and a regular minimal positive basis, the approximation to the diagonal of the Hessian is also good, both having error of $\approx 10^{-4}$ . However, for a regular basis and a coordinate minimal positive basis, the approximations $d(x)$ are poor, both having large error $\approx 10^{2}$ . Upon further inspection, one notices that $d(x)$ for a regular basis is essentially just a shifted version of the approximation for a regular minimal positive basis (subtracting $\approx 2.2\times 10^{2}$ from each component of the regular basis approximation gives the approximation for a regular minimal positive basis). A similar comment holds for the approximations for a coordinate basis and for a coordinate minimal positive basis. This is consistent with the Theorems presented in this work. For example, comparing Theorems 6 and 20, both estimates are the vector $z$ , with each component shifted by a fixed amount, with the fixed amount differing for each theorem.

Point near the solution

Here, the previous experiment is repeated at a point $x$ near the solution $x^{*}=[1,1]^{T}$ . Here, let

x=\begin{bmatrix}0.9\\ 0.81\end{bmatrix}.

(82)

When close to the solution, it is appropriate to choose $h$ to be small, so here we set $h=10^{-6}$ . Substituting (82) into (77) and (78) gives

\displaystyle\nabla f(x)=\begin{bmatrix}-0.20000000\\ 0\end{bmatrix},\qquad{\rm diag}(\nabla^{2}f(x))=10^{2}\times\begin{bmatrix}6.50000000\\ 2.00000000\end{bmatrix}.

(83)

For this experiment with the point (82), the theoretical error bound for the gradient approximations is again $\tfrac{1}{6}Mh^{2}\|U_{+}^{{\dagger}}\|_{2}\sqrt{n+1}$ , where the Lipschitz constant is estimated as $M=1.8466\times 10^{3}$ , so that $\tfrac{1}{6}Mh^{2}\sqrt{n}=4.3526\times 10^{-10}$ .

	$g(x)$	$\epsilon_{g}$	$d(x)$	$\epsilon_{d}$
CB	$\begin{bmatrix}-0.19999999\\ 0\end{bmatrix}$	$3.54\times 10^{-10}$	$10^{2}\times\begin{bmatrix}6.49999998\\ 1.99999999\end{bmatrix}$	$1.91\times 10^{-6}$
RB	$\begin{bmatrix}-0.19999999\\ 0.00000000\end{bmatrix}$	$4.09\times 10^{-10}$	$10^{2}\times\begin{bmatrix}8.30000000\\ 3.80000003\end{bmatrix}$	$2.55\times 10^{2}$
CMPB	$\begin{bmatrix}-0.19999999\\ -0.00000000\end{bmatrix}$	$2.95\times 10^{-10}$	$10^{2}\times\begin{bmatrix}4.09999999\\ -0.39999999\end{bmatrix}$	$3.39\times 10^{2}$
RMPB	$\begin{bmatrix}-0.19999999\\ -0.00000000\end{bmatrix}$	$2.67\times 10^{-10}$	$10^{2}\times\begin{bmatrix}6.49999999\\ 2.00000001\end{bmatrix}$	$1.69\times 10^{-6}$

Table 5: The computed gradient and diagonal Hessian approximations using the Coordinate Basis, the Regular Basis, the Coordinate Minimal Positive Basis and the Regular Minimal Positive Basis. The computed error in each approximation is also reported.

Table 5 again shows the the computed gradient approximations are all close to the true solution (83), with the error the gradient approximations begin $\approx 10^{-10}$ , all of which are below the theoretical bound. For a coordinate basis and a regular minimal positive basis, the approximation to the diagonal of the Hessian is also good, both having error of $\approx 10^{-6}$ . However, for a regular basis and a coordinate minimal positive basis, the approximations $d(x)$ are again poor, both having large error $\approx 10^{2}$ . Again, the solution $d(x)$ for a coordinate minimal positive basis is a shifted version of the approximation for a coordinate basis, and similarly for the regular basis and regular minimal positive basis.

6.2 Derivative estimates within a frame based preconditioned conjugate gradients algorithm

The purpose of this section is to give an example of how the gradient estimates presented in this work could be used within an existing derivative free optimization algorithm. The algorithm employed here is the frame based preconditioned conjugate gradients method from [12] (henceforth referred to as FB-PCG), which can be applied to problems of the form (1) where $f$ is a smooth function whose derivatives are unavailable. The algorithm uses a Conjugate Gradients (CG) type inner loop, so approximations to the gradient are needed. It is well known that the practical behaviour of CG often improves when an appropriate preconditioner is used. Thus, the inexpensive estimates $g(x)$ and $D(x)$ are of particular use here; the $D(x)$ developed in this work is an inexpensive approximation to the Hessian, and being diagonal, is a convenient preconditioner. The algorithm in [12] employs frames as a globalisation strategy to ensure convergence.

Here, the algorithm is implemented exactly the same way as in [12]. The only difference is the way in which the gradient and pure second derivatives are computed. In [12], $g(x)$ and $D(x)$ are estimated using finite differences. Here we set $\eta=-1$ , and run the algorithm using the four derivative variants presented previously in this work: (i) a Coordinate Basis (CB), which is equivalent to finite difference approximations because $\eta=-1$ , (ii) a Regular Basis (RB), (iii) a Coordinate Minimal Positive Basis (CMPB), and (iv) a Regular Minimal Positive Basis (RMPB). It is expected that the practical behaviour of the algorithm will be similar in each instance, because all gradient approximations used are $\mathcal{O}(h^{2})$ accurate. However, we comment that for a coordinate minimal positive basis, the approximation $d(x)$ was computed using finite differences, to ensure the best possible algorithmic performance.

Algorithm 1 Frame-Based Preconditioned Conjugate Gradients (FB-PCG)

1: Initialization: Set

k=1

j=n

h=\theta_{0}=1

H=I

N>0

\tau_{\min}>0

h_{\min}>0

\nu>1

2: while Stopping conditions do not hold do

3: Calculate the function values at the frame points, and form the gradient estimate

g(x_{k})

4: if

j=1

then

5: Form the pure second derivative estimate

D(x_{k})

6: end if

7: Check the stopping conditions.

8: Calculate the new search direction

p_{k+1}=-Hg_{k+1}+\beta_{k}p_{k}

, where

\displaystyle\beta_{k}=\max\{0,\frac{g_{k}^{T}H(g_{k+1}-g_{k})}{g_{k}^{T}Hg_{k}}\},\quad H_{ii}=\frac{1}{\max\{D_{i},10^{-4}\}}

(84)

9: Execute line search: find

\theta_{k}=\min_{\theta}f(x_{k}+\theta h_{k}p_{k}/\|p_{k}\|)

10: if

j=1

then

11: update

H

, set

x_{k+1}

to be the point with lowest known function value, set

j=n+3

12: else

13: decrease

j

by 1 and set

x_{k+1}=x_{k}+\theta h_{k}p_{k}/\|p_{k}\|

14: end if

15: If frame is quasi-minimal set

h_{k+1}=h_{k}/\lambda

; else

h_{k+1}=h_{k}

. Update

k=k+1

16: end while

The algorithm is guaranteed to converge, as confirmed by the following theorem.

Theorem 24 (Theorem 1 in [12]).

If the following conditions are satisfied: (1) the sequence of iterates is bounded; (2) $f$ is continuously differentiable with a locally Lipschitz gradient; and (3) $h_{k}\to 0$ as $k\to\infty$ , then the subsequence of quasi-minimal frame centers is infinite, and all cluster points of this subsequence are stationary points of $f$ .

Remark 25.

The algorithm of [23] is based upon the FB-PCG algorithm in [12], but there are several differences. In [23], a single minimal positive basis is employed at each iteration (i.e., a linear model is used, and approximations to the second derivatives are unavailable). Also, for the algorithm in [23], at each iteration, the new point $x_{k+1}$ is always set to be the point with the best (lowest) function value encountered. On the other hand, the Algorithm 1 from [12] performs $n+3$ iterations of PCG, to maintain the positive properties of the PCG algorithm. After $n+3$ iterations there is a reset, where the PCG process begins again from the point with the lowest function value over the preceding $n+3$ iterations. Because [23] employs only $\mathcal{O}(h)$ approximations to $g(x)$ and does not use approximations $d(x)$ , we do not consider it here.

The problems tested are taken from the MGH test set in [27] and were chosen to allow comparison with the results in [12] and [23]. The problems are stated in Table 6, and the results of applying the FB-PCG algorithms to the test problems are given in Tables 7 and 8. As suggested in Section 2.2 of [26], each algorithm was terminated if it reached the maximum budget of 1300 function evaluations.

No.	Function Name	$n$	No.	Function Name	$n$
1	Rosenbrock	2	12	Box 3-D	3
2	Freudenstein and Roth	2	13	Powell singular	4
3	Powell badly scaled	2	14	Woods function	4
4	Brown badly scaled	2	15	Kowalski and Osborne	4
5	Beale	2	16	Brown and Dennis	4
6	Jennrich and Sampson	2	17	Osborne 1	5
7	Helical valley function	3	18	Bigg’s exponential	6
8	Bard	3	19	Osborne 2	11
9	Gaussian function	3	20	Trigonometric	2
10	Meyer function	3	21	Brown almost linear	2
11	Gulf research & development	3	22	Broyden tridiagonal	20

Table 6: The test problems used in the numerical experiments.

Prob.	grad type	nf	fmin	$\\|g(x)\\|$	h	itns	qmfs
1	CB	300	5.2336e-11	7.6674e-06	5.6843e-07	34	15
	RB	1300	5.3137e+00	4.7256e+00	1.5625e-02	149	3
	CMPB	345	2.8696e-13	1.3357e-06	3.5527e-06	31	15
	RMPB	381	5.5919e-11	8.3890e-06	2.2737e-06	34	14
2	CB	117	4.8984e+01	8.9198e-05	9.5367e-06	14	9
	RB	102	4.8984e+01	2.6257e-06	3.8147e-06	12	9
	CMPB	245	5.9689e-18	9.3936e-08	5.9605e-07	23	11
	RMPB	165	4.8984e+01	8.6418e-06	5.9605e-06	16	10
3	CB	1300	1.0178e-08	4.8733e+00	6.2500e-10	86	44
	RB	1300	1.1101e-08	2.2184e+00	1.5558e-09	102	49
	CMPB	208	1.0101e-08	2.0101e-04	1.0000e-10	18	18
	RMPB	1300	1.5221e-07	1.7535e+01	1.7937e-08	71	36
4	CB	188	2.8703e-24	3.3884e-06	9.0949e-08	20	15
	RB	191	4.1550e-26	4.0788e-07	1.0000e-10	21	19
	CMPB	1300	1.2157e+09	6.9664e+10	9.7656e+00	119	1
	RMPB	274	7.5858e-24	5.5090e-06	2.2737e-10	21	18
5	CB	96	1.7741e-12	9.5287e-06	9.5367e-06	12	9
	RB	131	7.6731e-14	7.3698e-07	1.4901e-06	16	11
	CMPB	137	1.9443e-13	7.2973e-07	9.5367e-07	14	10
	RMPB	135	1.3844e-12	1.0291e-06	9.5367e-06	13	9
6	CB	214	1.2436e+02	1.1450e-06	9.3132e-08	26	13
	RB	179	1.2436e+02	6.9703e-04	2.3842e-06	21	10
	CMPB	242	1.2436e+02	3.9619e-06	1.4901e-07	23	12
	RMPB	198	1.2436e+02	4.9373e-05	5.9605e-08	20	12
7	CB	277	2.4478e-16	4.0593e-08	2.2737e-08	27	16
	RB	358	1.3523e-12	5.2517e-06	3.6380e-06	33	13
	CMPB	386	1.1919e-17	1.2976e-07	2.2204e-08	31	18
	RMPB	404	1.3124e-13	6.2378e-07	5.6843e-08	32	16
8	CB	228	8.2149e-03	1.5169e-06	3.6380e-08	21	15
	RB	226	8.2149e-03	1.2996e-06	1.4552e-07	21	14
	CMPB	332	8.2149e-03	3.2610e-06	9.3132e-08	27	13
	RMPB	325	8.2149e-03	5.2576e-06	5.6843e-08	25	16
9	CB	88	1.1279e-08	3.0817e-08	3.8147e-06	9	9
	RB	88	1.1279e-08	2.6819e-08	3.8147e-06	9	9
	CMPB	106	1.1279e-08	2.5277e-08	3.8147e-06	9	9
	RMPB	106	1.1279e-08	2.2890e-07	3.8147e-06	9	9
10	CB	1300	2.6761e+04	1.4694e+06	1.7937e-03	95	31
	RB	1300	4.7088e+06	1.4611e+08	8.6736e-05	114	16
	CMPB	1300	8.7574e+06	3.3756e+08	5.5511e-02	99	12
	RMPB	1300	6.6179e+06	2.0357e+08	8.6736e-04	95	15
11	CB	611	2.1173e-12	5.3498e-06	1.2622e-08	50	27
	RB	586	8.6109e-10	9.3682e-06	8.0779e-09	48	26
	CMPB	653	5.8997e-10	7.3782e-06	8.2718e-09	46	24
	RMPB	1300	3.7178e-04	1.1058e-02	7.1746e-05	97	32

Table 7: Results of FB-PCG applied to problems 1–11 from Table 6.

Prob.	op	nf	fmin	ng	h	itn	qmf
12	CB	258	1.0126e-06	8.9612e-06	5.6843e-10	22	18
	RB	132	7.8257e-17	1.1635e-08	5.9605e-06	12	10
	CMPB	336	3.7634e-08	5.5972e-06	3.5527e-09	24	18
	RMPB	170	8.7957e-13	9.6558e-08	1.4901e-06	13	11
13	CB	388	9.5087e-09	7.8340e-06	8.8818e-10	29	19
	RB	680	1.2044e-11	3.4770e-06	8.2718e-10	53	25
	CMPB	539	9.7590e-10	2.7194e-06	1.3878e-09	36	20
	RMPB	732	1.2329e-09	2.2736e-06	5.2940e-09	49	23
14	CB	496	2.2343e-13	3.8564e-06	1.3878e-08	39	19
	RB	664	1.0394e-12	9.6018e-06	1.3553e-08	52	21
	CMPB	1175	7.0443e-21	3.2592e-09	3.1554e-10	79	29
	RMPB	1200	2.6560e-16	4.1629e-07	3.0815e-09	80	30
15	CB	409	3.0751e-04	7.0794e-08	3.4694e-10	29	21
	RB	730	3.0751e-04	8.1424e-07	3.0815e-09	54	30
	CMPB	456	3.0751e-04	2.1624e-07	1.0000e-10	29	22
	RMPB	879	3.0751e-04	2.9915e-06	1.9259e-10	56	32
16	CB	218	8.5822e+04	2.0390e-01	2.3842e-05	18	9
	RB	256	8.5822e+04	5.2632e-02	5.9605e-06	21	10
	CMPB	328	8.5822e+04	8.3039e-03	9.5367e-06	23	9
	RMPB	269	8.5822e+04	1.3964e-02	5.9605e-06	19	10
17	CB	1300	6.1863e-05	7.8087e-02	7.5232e-06	86	29
	RB	1300	9.2307e-02	6.5300e+01	1.5259e-03	86	6
	CMPB	1300	3.3394e-03	6.3918e+00	9.5367e-04	77	7
	RMPB	1300	5.3120e-02	1.9312e+01	9.5367e-04	78	7
18	CB	522	5.6556e-03	5.2532e-06	8.6736e-09	30	20
	RB	1300	5.6557e-03	1.8064e-04	1.1210e-08	73	37
	CMPB	712	5.6557e-03	3.1441e-06	2.1176e-09	36	23
	RMPB	1300	5.6559e-03	8.2987e-04	7.3468e-08	65	33
19	CB	1300	4.0437e-02	4.3713e-02	1.3235e-03	48	18
	RB	1300	2.8701e-01	2.5030e+00	6.2500e-02	49	2
	CMPB	1300	4.6094e-02	2.3523e-01	5.2940e-03	45	17
	RMPB	1300	2.8656e-01	3.4951e+00	6.2500e-02	46	2
20	CB	80	9.2331e-18	1.3956e-08	3.8147e-06	10	9
	RB	78	1.0496e-17	1.4880e-08	3.8147e-06	10	9
	CMPB	97	1.9978e-17	2.0529e-08	3.8147e-06	10	9
	RMPB	99	8.0511e-18	1.3032e-08	3.8147e-06	10	9
21	CB	77	2.3695e-19	1.8024e-09	3.8147e-06	10	9
	RB	77	1.5376e-16	6.2567e-08	3.8147e-06	10	9
	CMPB	98	6.8671e-14	1.3603e-06	3.8147e-06	10	9
	RMPB	96	7.0015e-15	4.3732e-07	3.8147e-06	10	9
22	CB	1155	7.4976e-13	7.8573e-06	1.0000e-10	26	19
	RB	1244	7.6929e-13	8.5455e-06	1.0000e-10	28	18
	CMPB	1207	2.9752e-13	5.3954e-06	1.0000e-10	26	19
	RMPB	1113	4.3271e-13	8.0808e-06	1.0000e-10	24	18

Table 8: Results of FB-PCG applied to problems 12–22 from Table 6.

The experiments show that the Frame Based Preconditioned Conjugate Gradients algorithm performs similarly regardless of which gradient and diagonal Hessian approximation is used, as expected.

7 Conclusion

This work studied the use of interpolation models to obtain approximations to the gradient and diagonal of the Hessian of a function $f:\mathbf{R}^{n}\to\mathbf{R}$ . We showed that if the coordinate basis is used in the interpolation model, with the parameter $\eta=-1$ used to generate the interpolation points, then one recovers the standard finite difference approximations to the derivative. We also showed that if a regular basis, a coordinate minimal positive basis, or a regular minimal positive basis are used within the interpolation model, then approximations $g(x)$ to the gradient, and $d(x)$ to the pure second derivatives are available in $\mathcal{O}(n)$ flops. An error bound was presented which shows that for each set of interpolation directions considered in this work, the gradient is $\mathcal{O}(h^{2})$ accurate. Numerical experiments were presented that show the practical uses of the estimates developed. In particular, we showed how they could be employed within the FB-PCG algorithm to solve (1).

References

[1] Charles Audet and Warren Hare. Derivative-free and blackbox optimization. Springer Series in Operations Research and Financial Engineering. Springer, 2017.
[2] A. S. Bandeira, K. Scheinberg, and L. N. Vicente. Computation of sparse low degree interpolating polynomials and their application to derivative-free optimization. Mathematical Programming Series B, 134:223–257, 2012.
[3] A. G. Baydin, B. A. Pearlmutter, A. A. Radul, and J. M. Siskind. Automatic differentiation in machine learning: a survey. Journal of Machine Learning Research, 18:1–43, 2018. Editor: Léon Bottou.
[4] A. S. Berahas, L. Cao, K. Choromanskiy, and K. Scheinberg. A theoretical and empirical comparison of gradient approximations in derivative-free optimization. Technical report, Department of Industrial and Systems Engineering, Lehigh University,, Bethlehem, PA, USA, 2019. arXiv:1905.01332v2 [math.OC].
[5] G. Cocchi, G. Liuzzi, A. Papini, and M. Sciandrone. An implicit filtering algorithm for derivative-free multiobjective optimization with box constraints. Computational Optimization and Applications, 69(2):267–296, 2018.
[6] A. R. Conn, K. Scheinberg, and Luís N. Vicente. Geometry of interpolation sets in derivative free optimization. Mathematical Programming Series B, 111:141–172, 2008.
[7] A. R. Conn, K. Scheinberg, and Luís N. Vicente. Geometry of sample sets in derivative-free optimization: polynomial regression and underdetermined interpolation. IMA Journal of Numerical Analysis, 28:721–748, 2008.
[8] Andrew R. Conn and Philippe L. Toint. An algorithm using quadratic interpolation for unconstrained derivative free optimization. In G. Di Pillo and F. Giannessi, editors, Nonlinear Optimization and Applications, pages 27–47, Boston, MA, 1996. Springer US.
[9] A.R. Conn, K.Scheinberg, and L.N. Vicente. Introduction to Derivative-Free Optimization. MPS-SIAM Series on Optimization, Philadelphia, 2009.
[10] A.R. Conn, K.Scheinberg, and L.N. Vicente. Introduction to derivative-free optimization. MPS-SIAM Series on Optimization, Philadelphia, 2009.
[11] A.R. Conn, K. Scheinberg, and P.L. Toint. Recent progress in unconstrained nonlinear optimization without derivatives. Mathematical Programming, 79:397–414, 1997.
[12] I. D. Coope and C. J. Price. A derivative-free frame-based conjugate gradients method. Ucdms2002-7, University of Canterbury, 2002. https://ir.canterbury.ac.nz/handle/10092/11758.
[13] I. D. Coope and R. Tappenden. Efficient calculation of regular simplex gradients. Computational Optimization and Applications, 72(3):561–588, April 2019.
[14] I.D. Coope and C.J. Price. Frame-based methods for unconstrained optimization. Journal of Optimization Theory & Applications, 107(2):261–274, 2000.
[15] I.D. Coope and C.J. Price. Positive bases in numerical optimization. Computational Optimization & Applications, 21:169–175, 2002.
[16] Giovanni Fasano, José Luis Morales, and Jorge Nocedal. On the geometry phase in model-based algorithms for derivative-free optimization. Optimization Methods and Software, 24(1):145–154, 2009.
[17] Maryam Fazel, Rong Ge, Sham Kakade, and Mehran Mesbahi. Global convergence of policy gradient methods for the linear quadratic regulator. Proceedings of Machine Learning Research (PMLR), 80:1467–1476, 2018. International Conference on Machine Learning, 10-15 July 2018, Stockholmsmässan, Stockholm, Sweden.
[18] P. Gilmore, C. T. Kelley, C. T. Miller, and G. A. Williams. Implicit filtering and optimal design problems. In Jeffrey Borggaard, John Burkardt, Max Gunzburger, and Janet Peterson, editors, Optimal Design and Control, pages 159–176, Boston, MA, 1995. Birkhäuser Boston.
[19] P. Gilmore and C.T. Kelley. An implicit filtering algorithm for optimization of functions with many local minima. SIAM J. Optim., 5(2):269–285, 1995.
[20] W. Hare and M. Jaberipour. Adaptive interpolation strategies in derivative-free optimization: a case study. Technical report, University of British Colombia, Canada, and Amirkabir University of Technology, Iran, November 2015. arXiv:1511.02794v1 [math.OC].
[21] Philipp H. W. Hoffmann. A hitchhiker’s guide to automatic differentiation. Numerical Algorithms, 72(3):775–811, 2016.
[22] Gabriel Jarry-Bolduc, Patrick Nadeau, and Shambhavi Singh. Uniform simplex of an arbitrary orientation. Optimization Letters, 2019. Published online 03 July 2019, https://doi.org/10.1007/s11590-019-01448-3.
[23] Q. Liu. Two minimal positive basis based direct search conjugate gradient methods for computationally expensive functions. Numer Algor, 58:461–474, 2011.
[24] Alvaro Maggiar, Andreas Wächter, Irina S Dolinskaya, and Jeremy Staum. A derivative-free trust-region algorithm for the optimization of functions smoothed via gaussian convolution using adaptive multiple importance sampling. SIAM Journal on Optimization, 28(2):1478–1507, 2018.
[25] C. C. Margossian. A review of automatic differentiation and its efficient implementation. Technical report, Department of Statistics, Columbia University, January 2019. arXiv:1811.05031v2 [cs.MS].
[26] J. J. Moré and S. Wild. Benchmarking derivative-free optimization algorithms. SIAM Journal on Optimization, 20(1):172–191, 2009.
[27] J.J. More, B.S. Garbow, and K.E. Hillstrom. Testing unconstrained optimization software. ACM Trans. Math. Softw., 7(1):17–41, 1981.
[28] J.A. Nelder and R. Mead. A simplex method for function minimization. The computer journal, 7(4):308–313, 1965.
[29] Yurii Nesterov and Vladimir Spokoiny. Random gradient-free minimization of convex functions. Foundations of Computational Mathematics, 17(2):527–566, 2017.
[30] Jorge Nocedal and Stephen J. Wright. Numerical Optimization. Springer Series in Operations Research. Springer, 2 edition, 2006.
[31] Luis Miguel Rios and Nikolaos V. Sahinidis. Derivative-free optimization: a review of algorithms and comparison of software implementations. J Glob Optim, 56:1247–1293, 2013.
[32] W. Spendley, G.R. Hext, and F.R. Himsworth. Sequential application of simplex designs in optimisation and evolutionary operation. Technometrics, 4:441–461, 1962.
[33] Stefan M. Wild and Christine Shoemaker. Global convergence of radial basis function trust-region algorithms for derivative-free optimization. SIAM Review, 55(2):349–371, 2013.

Appendix A Constants

In this section we state several equivalences between constants that are used in this work.

\displaystyle\tfrac{\gamma}{1-n\gamma}=\gamma\sqrt{n+1}=\tfrac{1}{n}(\sqrt{n+1}-1),

(85)

\displaystyle\gamma^{2}\overset{\eqref{eq:alphagamma}}{=}\tfrac{1}{n^{2}}\left(1-\tfrac{1}{\sqrt{n+1}}\right)^{2}=\tfrac{1}{n^{2}}\left(\tfrac{1}{n+1}+1-\tfrac{2}{\sqrt{n+1}}\right)=\tfrac{1}{n^{2}}\left(\tfrac{n+2}{n+1}-\tfrac{2}{\sqrt{n+1}}\right)

(86)

	$\displaystyle\alpha^{2}\gamma^{2}\overset{\eqref{eq:alphagamma}+\eqref{eq:gammasquared}}{=}\tfrac{n+1}{n^{3}}\left(\tfrac{n+2}{n+1}-\tfrac{2}{\sqrt{n+1}}\right)$	$\displaystyle=$	$\displaystyle\tfrac{1}{n^{3}}(n+2-2\sqrt{n+1})$		(87)
		$\displaystyle=$	$\displaystyle\tfrac{1}{n^{3}}(\sqrt{n+1}-1)^{2}.$		(88)

\displaystyle 1-2\gamma

\displaystyle=

\displaystyle\left(1-\tfrac{2}{n}+\tfrac{2}{n\sqrt{n+1}}\right)=\tfrac{1}{n}\left(n-2+\tfrac{2}{\sqrt{n+1}}\right)=\tfrac{(n-2)\sqrt{n+1}+2}{n\sqrt{n+1}}.

(89)

\displaystyle\mu=\alpha^{2}(1-2\gamma)

\displaystyle\overset{\eqref{eq:oneminustwogamma}}{=}

\displaystyle\tfrac{n+1}{n}\left(\tfrac{(n-2)\sqrt{n+1}+2}{n\sqrt{n+1}}\right)=\tfrac{1}{n^{2}}\left((n-2)(n+1)+2\sqrt{n+1}\right).

(90)

\displaystyle\omega n=\tfrac{n\gamma^{2}}{1-2\gamma}\overset{\eqref{eq:oneminustwogamma}+\eqref{eq:gammasquared}}{=}\left(\tfrac{\sqrt{n+1}}{(n-2)\sqrt{n+1}+2}\right)\left(\tfrac{n+2-2\sqrt{n+1}}{n+1}\right)=\left(\tfrac{n+2-2\sqrt{n+1}}{(n-2)(n+1)+2\sqrt{n+1}}\right)

(91)

\displaystyle 1+\omega n\overset{\eqref{eq:omegan}}{=}\tfrac{(n-2)(n+1)+2\sqrt{n+1}}{(n-2)(n+1)+2\sqrt{n+1}}+\left(\tfrac{n+2-2\sqrt{n+1}}{(n-2)(n+1)+2\sqrt{n+1}}\right)=\tfrac{n^{2}}{(n-2)(n+1)+2\sqrt{n+1}}

(92)

\displaystyle\tfrac{\omega n}{1+\omega n}\overset{\eqref{eq:omegan}+\eqref{eq:oneplusomegan}}{=}\tfrac{1}{n^{2}}\left(n+2-2\sqrt{n+1}\right)

(93)

\displaystyle 1-\mu\overset{\eqref{eq:muasn}}{=}\tfrac{1}{n^{2}}\left(n^{2}-\left((n-2)(n+1)+2\sqrt{n+1}\right)\right)\overset{\eqref{eq:omeganoveroneplusomegan}}{=}\tfrac{\omega n}{1+\omega n}

(94)