Gradient and Hessian approximations in Derivative Free Optimization
Abstract
This work investigates finite differences and the use of interpolation models to obtain approximations to the first and second derivatives of a function. Here, it is shown that if a particular set of points is used in the interpolation model, then the solution to the associated linear system (i.e., approximations to the gradient and diagonal of the Hessian) can be obtained in computations, which is the same cost as finite differences, and is a saving over the cost when solving a general unstructured linear system. Moreover, if the interpolation points are formed using a ‘regular minimal positive basis’, then the error bound for the gradient approximation is the same as for a finite differences approximation. Numerical experiments are presented that show how the derivative estimates can be employed within an existing derivative free optimization algorithm, thus demonstrating one of the potential practical uses of these derivative approximations.
Keywords. Derivative Free Optimization; Positive Bases; Finite difference approximations; Interpolation models; Taylor series; Conjugate Gradients; Simplices; Simplex Gradients; Preconditioning; Frames.
1 Introduction
This work considers unconstrained optimization problems of the form
(1) |
where is smooth, but the derivatives of are unavailable. This is a typical derivative free optimization setting. Problems of this form arise, for example when the function is smooth but it is computationally impractical to obtain derivative information, or when the function itself is not known explicitly although function values are available from a black-box or oracle, (for example via physical measurements).
A key ingredient of many algorithms for solving (1) in a derivative free setting is an estimate of the gradient of . Sometimes gradients are used to determine search directions (for example, in implicit filtering algorithms [5, 18, 19]), and sometimes they are used in the definition of an algorithm’s stopping criteria. Several techniques exist for determining estimates to derivatives, including automatic differentiation [21, 3, 25], finite difference approximations, interpolation models [8, 11], Gaussian smoothing [24, 29], and smoothing on the unit sphere [17]. This work investigates finite differences and interpolation models, so henceforth, we focus on these approaches.
A well known, simple, and widely applicable approach to estimating derivatives is finite difference approximations (or finite differences). Finite differences are so called because they provide estimates to derivatives by taking the difference between function values at certain known points. For example, for a function , the forward differences approximation to the (first) derivative is
(2) |
which is an accurate approximation. Many other (well known) formulations exist, including approximations that provide a higher accuracy estimates of the gradient, as well as formulations for higher order derivatives. The standard finite difference approximation formulations can be used whenever function values evaluated along coordinate directions are available. The formulations involve minimal computations ( operations), and they have low memory requirements because one does not need to store the coordinate directions, but finite differences can suffer from cancellation errors.
Another approach to approximating derivatives is to use interpolation models [9, 33, 20]. Interpolation models can be viewed as a generalization of finite differences, where, instead of being restricted to the coordinate directions, function evaluations along a general set of directions are utilized, and derivative approximations are found by solving an associated system of equations (i.e., minimizing the interpolation model). Linear models lead to approximations to the gradient [4], while quadratic models provide approximations to the gradient and Hessian [8]. Much research has focused on the ‘quality’ of the set of directions, because this impacts the accuracy of the derivative estimates [2, 6, 7, 16].
The purpose of this work is to show that, if one uses a (diagonal) quadratic interpolation model with carefully chosen interpolation directions and function values, then approximations to the gradient and diagonal of the Hessian can be obtained in computations and with only vectors of storage. The solution to this specific instance of an interpolation model can be thought of, in some sense, as a general finite difference type approximation to the gradient and diagonal of the Hessian, because the solution to the interpolation model is a formula that involves sums/differences of function values at the interpolation points. Although the focus of this work is on derivative estimates and not algorithmic development, the estimates here are widely applicable, and could be employed by many algorithms that require gradient approximations. Moreover, the diagonal Hessian approximation is readily available, and could be used as preconditioner within an algorithm. (See the numerical experiments in Section 6, which include a derivative free preconditioned conjugate gradients algorithm.)
It remains to note that the approaches mentioned above all require access to function values, and the cost of evaluating a function in a derivative free optimization setting can vary greatly depending upon the particular application being considered. It is up to the user to determine their function evaluation budget, and the approaches in this work require between and function evaluations to generate derivative approximations.
1.1 Summary of Contributions
The contributions of this work are listed now.
-
1.
Derivative estimates. This work builds a quadratic interpolation model (with a diagonal approximation to the Hessian), and shows that if a certain set of interpolation directions are used in the quadratic model, with associated function evaluations, then approximations to the gradient and diagonal of the Hessian can be found in operations. The set of interpolation directions is referred to as a regular minimal positive basis, and these directions need not be stored in full, but can be computed on-the-fly from stored vectors.
-
2.
An error bound. We confirm that the gradient approximations developed in this work are accurate, where is the ‘sampling radius’. (See Section 4.5.)
-
3.
Application. We provide an example of how the gradient estimates developed in this work could be used in practice. In particular, we take the frame based, preconditioned conjugate gradients algorithm developed in [12], which requires estimates of the gradient and a diagonal preconditioner, and employ the estimates and developed in this work. (The original paper [12] employed the standard central difference approximations to the gradient and diagonal of the Hessian.)
1.2 Notation
Let denote the vector of all ones and let for denote the standard unit basis vectors in , i.e., is the th column of the identity matrix . (Usually we will take or in this work.)
Consider the points . Define for , i.e., denotes the function value at the point . It is convenient to define the following vectors, which contain the function values:
(3) |
Similarly, for the points , define for all , (i.e., denotes the function value at the point ), and
(4) |
It will also be convenient to define the vectors
(5) |
and
(6) |
Usually, vectors are assumed to lie in , while the subscript ‘’ denotes that the vector has an additional entry, so is an element of (recall the notation above). Matrices are assumed to be elements of , while the subscript ‘’ denotes an additional column, so that the matrix is an element of .
The symbol ‘’ denotes the Hadamard product, i.e., given , where .
1.3 Paper Outline
This paper is organised as follows. Section 2 introduces the (diagonal) quadratic interpolation model employed in this work, and shows how derivative approximations can be found using the model. Section 3, describes how approximations to the gradient and the diagonal of the Hessian can be constructed from the model in computations when the interpolation directions are chosen to be the coordinate basis, and also when they are chosen to be a ‘regular basis’ (see Section 3.2). Many algorithms in derivative free optimization are based upon simplices, which can be generated from a minimal positive basis. Thus, in Section 4, approximations to the gradient and the diagonal of the Hessian that can be computed in are presented when the interpolation points are formed from either a coordinate minimal positive basis, or a regular minimal positive basis (see Section 4.2). It will also be shown that the regular basis and the regular minimal positive basis are closely related to a regular simplex, hence the terminology ‘regular’ (see Section 4.1). In Section 4.5 an error bound is presented to confirm that the gradient approximations developed in this work are accurate, where is the sampling radius. Section 5 summarizes existing work on linear models, while numerical experiments are presented in Section 6 to demonstrate the accuracy and applicability of these results.
2 Derivative Estimation via Interpolation Models
This section describes how to approximate derivatives of a function via interpolation models. The Taylor series for a function about a point is
(7) |
Let denote an approximation to the gradient (i.e., ), and let denote an approximation to the Hessian (i.e., ). Then (7) becomes
(8) |
In this work, only a diagonal approximation to the Hessian is considered, so define
(9) |
where . Combining (8) and (9) leads to the (diagonal) quadratic model
(10) |
The motivation for this diagonal quadratic model is as follows. It is well known that, if one uses a linear model with (or ) interpolation points (i.e., or function evaluations), then an estimate of the gradient is available that is accurate (see also Section 5 which summarises results for linear models). For some situations, an accurate gradient is not of sufficient accuracy to make algorithmic progress, so an accurate gradient is used instead. In this case, one can still use a linear model, but (or ) interpolation points (i.e., (or ) function evaluations) are needed. In derivative free optimization, function evaluations are often expensive to obtain, so it is prudent to ‘squeeze out’ as much information from them as possible. Hence, suppose one had available (or ) function evaluations because an accurate gradient was desired. The model (10) contains unknowns, so interpolation points and function values is enough to uniquely determine an accurate gradient , as well as an approximation to the pure second derivatives on the diagonal of , i.e., no additional function evaluations are required to compute the pure second derivatives. Although can be determined ‘for free’ in terms of function evaluations (assuming are available for the gradient estimate), we will also show that the diagonal matrix can be computed in flops, so the linear algebra costs are low. This is similar to the case for finite differences (i.e., one requires function evaluations to compute via the central differences formula, and those same function values can also be used to compute the pure second derivatives), and so this diagonal quadratic model (10) is useful because one can obtain analogous results for a certain set of interpolation points.
The goal is to determine and from the interpolation model (10) using a set of known points and the corresponding function values. Hence, one must build a sample set (a set of interpolation points). In general, one can use any number of interpolation points, although this affects the model. For a linear model (omitting the third term in (10)), one can determine a unique solution if there are interpolation/sample points with known function values at those points (see Section 5). For a general quadratic model with Hessian approximation , one requires function values to determine unique solutions and . If one has access to more or fewer sample points/function values, then one can use, for example, a least squares approach to find approximations to and . In the rest of this section, interpolation points will be used, and it will become clear that this allows unique solutions and to the model (10).
Given , a set of directions , and a scalar (sometimes referred to as the sampling radius), a set of points is defined via the equations
(11) |
The model is constructed to satisfy the interpolation conditions
(12) |
Substituting into the model (10) gives the equations
(13) |
Because is diagonal, . Defining
(14) | |||||
(15) |
where , allows (13) to be written in matrix form as
(16) |
The system (16) is underdetermined, (there are only equations in unknowns), and as the goal is to find unique solutions and , an additional equations are required. Define
(17) |
Now, keep the point , and the set of directions fixed and choose satisfying (17). A new set of points is defined via the equations
(18) |
The definition of in (17) ensures that . The model is constructed to satisfy the interpolation conditions
(19) |
(That is, and for .) Following similar arguments to those previously established, one arrives at the system
(20) | |||||
The systems (16) and (20) can be combined into block matrix form as follows:
(21) |
It is clear that (22) represents a system of equations in 2n unknowns. Performing block row operations on the system (22) (i.e., multiplying the first row by and subtracting the second row, and also multiplying the first row by and subtracting the second row) gives
(22) |
Hence, estimates of the gradient and diagonal of the Hessian can be found by solving
(23) |
where
(24) | |||||
(25) |
Questions naturally arise, such as ‘when do unique solutions to (23) exist?’ and ‘what is the cost of computing the solutions?’. Clearly, if the matrices and have full rank, then unique solutions to (23) exist. The rank of and , of course, depends upon how the interpolation points are chosen. If the interpolation directions form a basis for , then a unique solution exists. Similarly, if the vectors , for form a basis for , then a unique solution exists.
Computing the solution to a general unstructured system has a cost of flops. The focus of the remainder of this paper is to show that, if one selects the interpolation points/interpolation directions in a specific way, then unique solutions and to (23) exist, and moreover, the computational cost of finding the solutions remains low, at . This is the same cost as finite difference approximations to the gradient and pure second order derivatives. Importantly, it will also be shown that this specific set of directions does not increase the memory footprint either, requiring only vectors of storage.
In what follows, it will be useful to notice that, in the special case , (24) and (25) become
(26) | |||||
(27) |
The Sherman-Morrison formula for a nonsingular matrix , and vectors is used many times in this work, and is stated now for convenience:
(28) |
3 Computing derivative estimates via interpolation
This section describes how to compute solutions to the diagonal quadratic interpolation model when the coordinate basis is used to construct the interpolation points, and also when the regular basis is employed. In both cases the computational cost of determining the solution is , and when the coordinate basis is used with in (17), the usual finite differences approximations to the first and second (pure) derivatives are recovered.
3.1 Interpolation using the coordinate basis
Here we present the solutions and to (23) when the coordinate basis is used to generate the interpolation points. Given a point , take , so that . Because , . Subsequently, the points satisfy , , . Substituting this into (23), shows that and . Now, set . Using (26) in the expression for gives
or equivalently, the th element of is
(29) |
Notice that (29) is the standard central differences approximation to the gradient, which is known to be accurate (see also Section 4.5).
Similarly, substituting (27) in the expression for gives
or equivalently, the th element of is
(30) |
which is the standard, central differences approximation to the diagonal entries of the Hessian.
This confirms that when the coordinate basis is used in the diagonal quadratic interpolation model, with , then (23) recovers the central finite differences formulas for the gradient and pure second order derivatives.
3.2 Interpolation using a regular basis
This section shows how gradient and diagonal Hessian estimates can be computed in flops if a specific basis is used to generate the interpolation points. For ease of reference, we call this a regular basis. (It will be shown that the directions forming the regular basis are of the internal arms of a regular simplex, see Section 4).
Define the scalars
(31) |
Let be the regular basis for , whose elements are defined by
(32) |
where the basis vectors can be collected as the columns of the matrix
(33) |
Equation (32) shows that the vectors are simply a multiple of (the vector of all ones) with an adjustment to the th component. Thus, these vectors need not be stored explicitly; they can simply be generated on-the-fly and then discarded. Therefore, the regular basis has a low storage footprint of vectors.
In [13, p.569], (see also page 20 and Corollary 2.6 in [10]) it is shown that for defined in (33),
which shows that the elements of the regular basis have unit length:
(34) |
The following Lemma confirms that is a basis for , and also that is positive definite.
Lemma 1 (Lemma 2 in [13]).
Corollary 2.
Let the conditions of Lemma 1 hold. Then and .
Now, a set of interpolation points are generated using the regular basis. Fix , and . Take as in (33), and define via
(35) |
Note that, because for all by (34), the points all lie on a circle of radius , and the points all lie on a circle of radius . Figure 1 provides an illustration showing the coordinate basis and regular basis in , and how they can be used to generate the interpolation points when (i.e., ).
We are now ready to state the main results of this section, which show that estimates of and can be obtained in computations when using the model (13) and a regular basis (33).
Theorem 3.
Proof.
Corollary 4.
Theorem 3 and Corollary 4 show that using a regular basis in the quadratic interpolation model leads to an estimate for that is similar to the central finite differences approximation in (29). The formula (29) uses a factor , while the formula (39) uses a factor . Both (29) and (39) involve the difference between the function values measured along the positive and negative basis directions. However, (39) also contains a ‘correction’ term, which involves the function values measured at all the vertices. Lastly, note that Theorem 3 requires knowledge of , but in the special case , Corollary 4 shows that is not used.
Before presenting the next theorem, the following parameters are defined. Let
(40) |
Remark 5.
Note that is monotonically increasing in . Moreover, , and when , , so for any .
Theorem 6.
Proof.
Corollary 7.
Theorem 6 and Corollary 7 show that using a regular basis in the quadratic interpolation model leads to an estimate for that is similar to the central finite differences approximation in (30). The formula (30) uses a factor , whereas (45) uses a factor . However, (39) also contains a ‘correction’ term, which is a kind of weighted average of the function values measured at the interpolation points.
4 Interpolation using minimal positive bases
Spendley et al. [32] and Nelder and Mead [28] carried out some of the pioneering work on derivative-free simplex-based algorithms. This work continues today with studies on properties of simplicies [1, 22] as well as work on simplex algorithm development, [31]. These, and many other algorithms in derivative free optimization, employ positive bases and simplices, and involve geometric operations such as expansions, rotations and contractions. In such cases, the original simplex/positive basis involves (or possibly more) points (with their corresponding function values), and if a geometric operation is performed, then this often doubles the number of interpolation points and corresponding function values that are available. As previously mentioned, function values are expensive, so it is prudent to make full use of them.
This section is motivated in part, by the ongoing research on simplex based methods, and the following types of ideas. Consider a simplex based algorithm, where one arrives at an iteration that involves an expansion step. In such a case, one would have access to function values (measured at the vertices of the 2 resulting original and expanded simplices) as a by-product. If an approximation to the gradient or diagonal of the Hessian was readily available, then this might be useful curvature information that could be built into a simplex based algorithm to help locate the solution more quickly. Moreover, depending on the type of geometric operation involved, the vertices of the 2 simplices are generated from the same set of base directions, one scaled by , and one scaled by some . Therefore, in this work we have not restricted to , to ensure the results are applicable to expansions, contractions, and not just rotations.
In this section we investigate the use of minimal positive bases (and their associated simplices) to generate interpolation points that can be used in the model (13). We will show that the results from the previous section carry over, in a natural way, when there are (rather than ) interpolation points.
4.1 Positive Bases and Simplices
Definition 8 (Positive Basis).
A set of vectors is called a positive basis for if and only if the following two conditions hold.
-
(i)
Every vector in is a nonnegative linear combination of the members of , i.e., it holds that
-
(ii)
No proper subset of satisfies (i).
Definition 9 (Minimal and Maximal Positive Bases).
A positive basis with elements is called a minimal positive basis. A positive basis with elements is called a maximal positive basis.
Lemma 10 (Minimal Positive Basis).
Let be a nonsingular matrix. Then the set is a minimal positive basis for .
Lemma 10 shows that the set is a minimal positive basis for . Lemma 10 also shows that the columns of
(46) |
where is defined in (33), form a minimal positive basis for (recall that is nonsingular by Lemma 1). It can be shown that (see also Lemma 3 in [13]),
(47) |
The minimal positive basis has many useful properties and it is shown in [13] that the angle between any pair of elements of the minimal positive basis is the same, while (47) shows that one of the directions is aligned with the vector . Henceforth, we refer to the columns of in (46) as a regular minimal positive basis.
Definition 11 (Affine independence).
A set of points is called affinely independent if the vectors are linearly independent.
Definition 12 (Simplex).
The convex hull of a set of affinely independent points is called a simplex of dimension .
Definition 13 (Regular Simplex).
A regular simplex is a simplex that is also a regular polytope.
Given a point , and the regular minimal positive basis , the sets of points and formed via
(48) |
are affinely independent. Hence, the convex hull of is a regular simplex aligned in the direction , and so too is the the convex hull of . This explains why we have adopted the terminology ‘regular’, and refer to the columns of as a regular basis, and the columns of as a regular minimal positive basis; both are related to a regular simplex (see Section 3.1 in [13]).
4.2 Interpolation using a minimal positive basis
In this section we show that estimates of the gradient and diagonal Hessian can be obtained in computations when a regular minimal positive basis is used to generate the interpolation points. Hence, let be a basis for , so that is nonsingular. Moreover, let , so that, by Lemma 10 the columns of are a minimal positive basis for . Let be as in (15) and define
(49) |
Following the same arguments as those presented in Section 2, estimates to the gradient and diagonal of the Hessian can be found by solving
(50) |
where
(51) |
and
(52) |
Clearly, the matrices in (50) are not square, so that and will be computed as least squares solutions to (50) in this section.
4.3 Interpolation using the coordinate minimal positive basis
Here, take , so that , which corresponds to the coordinate minimal positive basis . The following results hold.
Theorem 14.
Proof.
Note that
(53) |
and
(54) |
Combining the previous two gives shows that the least squares solution to (50) is
∎
Corollary 15.
Corollary 15 shows that when the coordinate minimal positive basis is used in the quadratic interpolation model, the expression for is similar to the central differences formula, except an additional shift term that involves the function values at all the interpolation points, is also added to each element.
Theorem 16.
Proof.
Corollary 17.
4.4 Interpolation using a regular minimal positive basis
Here, take with , where is defined in (33). Furthermore, let be as in (15) with
(56) |
We are now ready to present the main results of this section.
Theorem 18.
Proof.
Corollary 19.
Let the assumptions of Theorem 18 hold. Letting and letting
gives
(58) |
or equivalently, the th component of is
(59) |
Before presenting the next theorem, define
(60) |
Theorem 20.
Proof.
Note that
(62) | |||||
Applying the Sherman-Morrison-Woodbury formula (28) to (62) gives
(63) |
Furthermore,
(64) | |||||
The least squares solution to (50) is
Finally, (61) depends only upon operations involving vectors only (inner and scalar products and vector additions) so that is an computation. ∎
Corollary 21.
Let the assumptions of Theorem 20 hold. Setting and letting gives
4.5 Error bounds for the interpolation model
The following is a standard error bound; the proof is included for completeness.
Theorem 22.
Let be a minimal positive basis for and let . Suppose points are constructed via
(65) |
Let and suppose that is continuously differenctiable and that is -Lipschitz continuous. Then, the least squares solution to (50) satisfies
(66) |
Proof.
We begin by noticing that
(67) | |||||
By (6) and (65), the th element of (67) is
(68) | |||||
The integral form of the Mean Value Theorem provides the identities
(69) | |||||
(70) |
because , by (65). By [30, p.14], because is twice continuously differentiable, for ,
(71) |
Combining (69), (70) and (71), and subsequently subtracting shows that for each :
Hence, taking the 2-norm of (67), using the calculation above, and recalling the standard norm inequality (for vectors in ), gives (75). ∎
4.6 Summary for quadratic interpolation models
The following tables summarize our results for diagonal quadratic interpolations models. The formulae for are expressed in terms of defined in (51). The error bounds give , where , and holds when .
Error | # Samples | Theorem | ||
---|---|---|---|---|
CB | ||||
RB | Thm 3 | |||
CMPB | Thm 14 | |||
RMPB | Thm 18 |
# Samples | Theorem | ||
---|---|---|---|
CB | |||
RB | Thm 6 | ||
CMPB | Thm 16 | ||
RMPB | Thm 20 |
We conclude this section by remarking that, in the special case when , if a minimal positive basis is used to generate the points then there is enough information to uniquely determine all entries of (recall that , because function values are available, i.e., one can approximate rather than only . The algebra is straightforward, so is omitted for brevity.
5 Linear Interpolation Models
While quadratic interpolation models lead to accurate approximations to the gradient and diagonal of the Hessian, they require at least function evaluations (and more if the full Hessian is desired). For some applications, this cost is too high, and a linear interpolation model is preferred. In this case, one can obtain estimates of the gradient in computations, and this section details how. As usual, letting denote an approximation to the gradient , a linear model is
(72) |
Estimations of the gradient can be obtained easily from a linear model, although the approximations from such a model (using or function evaluations depending on the interpolation points) are accurate (in the previous cases with or function evaluations, the approximations are accurate).
The interpolation conditions are simply . If there are interpolation points (formed from a unit basis for then is the solution to
(73) |
while if there are interpolation points (i.e., a minimal basis for then is the solution to
(74) |
An explicit solution to (73) is known for a coordinate basis, and solutions to (74) are known for the coordinate minimal positive basis and a regular minimal positive basis.
An explicit solution to (73) for a regular basis (i.e., in (33)) is as follows. By (73), . Now, following the same arguments as the proof of Theorem 3 gives , or componentwise, for ,
The following theorem provides the standard bound for linear interpolation models.
Theorem 23.
Let be a minimal positive basis for and let . Suppose points are constructed via , . Let and suppose that is continuously differentiable with an -Lipschitz continuous gradient. Then, the least squares solution to (73) satisfies
(75) |
It is clear that if points are used to generate the interpolation model, then the bound above becomes .
The results for linear interpolation models are summarized in Table 3. The error bounds give , where , and holds when .
6 Numerical Experiments
Here we present numerical examples to verify the results of this paper, and to provide a concrete application to show that the results have practical use. All experiments are performed using MATLAB (version 2018a).
6.1 Derivative Computations
Here we depart from our usual notation and let with components so that Rosenbrock’s function can be written as
(76) |
The gradient of (76) is
(77) |
and the Hessian of (76) is
(78) |
Henceforth, we return to our usual notation.
For the first numerical experiment we simply use the results presented in this work to compute approximations to the gradient and diagonal of the Hessian at two different points, using Rosenbrock’s function. The points are chosen to be those where an accurate gradient is needed to make algorithmic progress on Rosenbrock’s function.
Point near valley floor
The first point is
(79) |
which was chosen because it is close to the “valley floor” for Rosenbrock’s function where a good approximation to the gradient is required to make algorithmic progress.
The aligned regular simplex is constructed using the approach previously presented. Here, so that and by (31). Recalling that (33) gives (to 4d.p.)
(80) |
For this experiment was used. For the regular basis (RB), the points were generated as , for , while the regular minimal positive basis (RMPB) used , for . For the coordinate basis (CB), the points were generated as , for , while the coordinate minimal positive basis used the previous points as well as and .
Substituting (79) into (77) and (78) shows that the analytic gradient and Hessian are
(81) |
Approximations to the gradient and Hessian were built using the theorems presented earlier in this work, and are stated in Table 4. Also stated in the table is
That is, is the computed error between the true gradient reported in (83) and the computed approximation. Similarly, is the computed error between the true Hessian diagonal reported in (83) and the computed approximation. Note that, for the given point (79), the theoretical error bound for the gradient approximations is , where the Lipschitz constant is estimated as , so that .
CB | ||||
---|---|---|---|---|
RB | ||||
CMPB | ||||
RMPB |
Table 4 shows the the computed gradient approximations are all close to the true solution (83). The error in the gradient for each approximation is , which is below the theoretical bound, as expected. For a coordinate basis and a regular minimal positive basis, the approximation to the diagonal of the Hessian is also good, both having error of . However, for a regular basis and a coordinate minimal positive basis, the approximations are poor, both having large error . Upon further inspection, one notices that for a regular basis is essentially just a shifted version of the approximation for a regular minimal positive basis (subtracting from each component of the regular basis approximation gives the approximation for a regular minimal positive basis). A similar comment holds for the approximations for a coordinate basis and for a coordinate minimal positive basis. This is consistent with the Theorems presented in this work. For example, comparing Theorems 6 and 20, both estimates are the vector , with each component shifted by a fixed amount, with the fixed amount differing for each theorem.
Point near the solution
Here, the previous experiment is repeated at a point near the solution . Here, let
(82) |
When close to the solution, it is appropriate to choose to be small, so here we set . Substituting (82) into (77) and (78) gives
(83) |
For this experiment with the point (82), the theoretical error bound for the gradient approximations is again , where the Lipschitz constant is estimated as , so that .
CB | ||||
---|---|---|---|---|
RB | ||||
CMPB | ||||
RMPB |
Table 5 again shows the the computed gradient approximations are all close to the true solution (83), with the error the gradient approximations begin , all of which are below the theoretical bound. For a coordinate basis and a regular minimal positive basis, the approximation to the diagonal of the Hessian is also good, both having error of . However, for a regular basis and a coordinate minimal positive basis, the approximations are again poor, both having large error . Again, the solution for a coordinate minimal positive basis is a shifted version of the approximation for a coordinate basis, and similarly for the regular basis and regular minimal positive basis.
6.2 Derivative estimates within a frame based preconditioned conjugate gradients algorithm
The purpose of this section is to give an example of how the gradient estimates presented in this work could be used within an existing derivative free optimization algorithm. The algorithm employed here is the frame based preconditioned conjugate gradients method from [12] (henceforth referred to as FB-PCG), which can be applied to problems of the form (1) where is a smooth function whose derivatives are unavailable. The algorithm uses a Conjugate Gradients (CG) type inner loop, so approximations to the gradient are needed. It is well known that the practical behaviour of CG often improves when an appropriate preconditioner is used. Thus, the inexpensive estimates and are of particular use here; the developed in this work is an inexpensive approximation to the Hessian, and being diagonal, is a convenient preconditioner. The algorithm in [12] employs frames as a globalisation strategy to ensure convergence.
Here, the algorithm is implemented exactly the same way as in [12]. The only difference is the way in which the gradient and pure second derivatives are computed. In [12], and are estimated using finite differences. Here we set , and run the algorithm using the four derivative variants presented previously in this work: (i) a Coordinate Basis (CB), which is equivalent to finite difference approximations because , (ii) a Regular Basis (RB), (iii) a Coordinate Minimal Positive Basis (CMPB), and (iv) a Regular Minimal Positive Basis (RMPB). It is expected that the practical behaviour of the algorithm will be similar in each instance, because all gradient approximations used are accurate. However, we comment that for a coordinate minimal positive basis, the approximation was computed using finite differences, to ensure the best possible algorithmic performance.
(84) |
The algorithm is guaranteed to converge, as confirmed by the following theorem.
Theorem 24 (Theorem 1 in [12]).
If the following conditions are satisfied: (1) the sequence of iterates is bounded; (2) is continuously differentiable with a locally Lipschitz gradient; and (3) as , then the subsequence of quasi-minimal frame centers is infinite, and all cluster points of this subsequence are stationary points of .
Remark 25.
The algorithm of [23] is based upon the FB-PCG algorithm in [12], but there are several differences. In [23], a single minimal positive basis is employed at each iteration (i.e., a linear model is used, and approximations to the second derivatives are unavailable). Also, for the algorithm in [23], at each iteration, the new point is always set to be the point with the best (lowest) function value encountered. On the other hand, the Algorithm 1 from [12] performs iterations of PCG, to maintain the positive properties of the PCG algorithm. After iterations there is a reset, where the PCG process begins again from the point with the lowest function value over the preceding iterations. Because [23] employs only approximations to and does not use approximations , we do not consider it here.
The problems tested are taken from the MGH test set in [27] and were chosen to allow comparison with the results in [12] and [23]. The problems are stated in Table 6, and the results of applying the FB-PCG algorithms to the test problems are given in Tables 7 and 8. As suggested in Section 2.2 of [26], each algorithm was terminated if it reached the maximum budget of 1300 function evaluations.
No. | Function Name | No. | Function Name | ||
---|---|---|---|---|---|
1 | Rosenbrock | 2 | 12 | Box 3-D | 3 |
2 | Freudenstein and Roth | 2 | 13 | Powell singular | 4 |
3 | Powell badly scaled | 2 | 14 | Woods function | 4 |
4 | Brown badly scaled | 2 | 15 | Kowalski and Osborne | 4 |
5 | Beale | 2 | 16 | Brown and Dennis | 4 |
6 | Jennrich and Sampson | 2 | 17 | Osborne 1 | 5 |
7 | Helical valley function | 3 | 18 | Bigg’s exponential | 6 |
8 | Bard | 3 | 19 | Osborne 2 | 11 |
9 | Gaussian function | 3 | 20 | Trigonometric | 2 |
10 | Meyer function | 3 | 21 | Brown almost linear | 2 |
11 | Gulf research & development | 3 | 22 | Broyden tridiagonal | 20 |
Prob. | grad type | nf | fmin | h | itns | qmfs | |
---|---|---|---|---|---|---|---|
1 | CB | 300 | 5.2336e-11 | 7.6674e-06 | 5.6843e-07 | 34 | 15 |
RB | 1300 | 5.3137e+00 | 4.7256e+00 | 1.5625e-02 | 149 | 3 | |
CMPB | 345 | 2.8696e-13 | 1.3357e-06 | 3.5527e-06 | 31 | 15 | |
RMPB | 381 | 5.5919e-11 | 8.3890e-06 | 2.2737e-06 | 34 | 14 | |
2 | CB | 117 | 4.8984e+01 | 8.9198e-05 | 9.5367e-06 | 14 | 9 |
RB | 102 | 4.8984e+01 | 2.6257e-06 | 3.8147e-06 | 12 | 9 | |
CMPB | 245 | 5.9689e-18 | 9.3936e-08 | 5.9605e-07 | 23 | 11 | |
RMPB | 165 | 4.8984e+01 | 8.6418e-06 | 5.9605e-06 | 16 | 10 | |
3 | CB | 1300 | 1.0178e-08 | 4.8733e+00 | 6.2500e-10 | 86 | 44 |
RB | 1300 | 1.1101e-08 | 2.2184e+00 | 1.5558e-09 | 102 | 49 | |
CMPB | 208 | 1.0101e-08 | 2.0101e-04 | 1.0000e-10 | 18 | 18 | |
RMPB | 1300 | 1.5221e-07 | 1.7535e+01 | 1.7937e-08 | 71 | 36 | |
4 | CB | 188 | 2.8703e-24 | 3.3884e-06 | 9.0949e-08 | 20 | 15 |
RB | 191 | 4.1550e-26 | 4.0788e-07 | 1.0000e-10 | 21 | 19 | |
CMPB | 1300 | 1.2157e+09 | 6.9664e+10 | 9.7656e+00 | 119 | 1 | |
RMPB | 274 | 7.5858e-24 | 5.5090e-06 | 2.2737e-10 | 21 | 18 | |
5 | CB | 96 | 1.7741e-12 | 9.5287e-06 | 9.5367e-06 | 12 | 9 |
RB | 131 | 7.6731e-14 | 7.3698e-07 | 1.4901e-06 | 16 | 11 | |
CMPB | 137 | 1.9443e-13 | 7.2973e-07 | 9.5367e-07 | 14 | 10 | |
RMPB | 135 | 1.3844e-12 | 1.0291e-06 | 9.5367e-06 | 13 | 9 | |
6 | CB | 214 | 1.2436e+02 | 1.1450e-06 | 9.3132e-08 | 26 | 13 |
RB | 179 | 1.2436e+02 | 6.9703e-04 | 2.3842e-06 | 21 | 10 | |
CMPB | 242 | 1.2436e+02 | 3.9619e-06 | 1.4901e-07 | 23 | 12 | |
RMPB | 198 | 1.2436e+02 | 4.9373e-05 | 5.9605e-08 | 20 | 12 | |
7 | CB | 277 | 2.4478e-16 | 4.0593e-08 | 2.2737e-08 | 27 | 16 |
RB | 358 | 1.3523e-12 | 5.2517e-06 | 3.6380e-06 | 33 | 13 | |
CMPB | 386 | 1.1919e-17 | 1.2976e-07 | 2.2204e-08 | 31 | 18 | |
RMPB | 404 | 1.3124e-13 | 6.2378e-07 | 5.6843e-08 | 32 | 16 | |
8 | CB | 228 | 8.2149e-03 | 1.5169e-06 | 3.6380e-08 | 21 | 15 |
RB | 226 | 8.2149e-03 | 1.2996e-06 | 1.4552e-07 | 21 | 14 | |
CMPB | 332 | 8.2149e-03 | 3.2610e-06 | 9.3132e-08 | 27 | 13 | |
RMPB | 325 | 8.2149e-03 | 5.2576e-06 | 5.6843e-08 | 25 | 16 | |
9 | CB | 88 | 1.1279e-08 | 3.0817e-08 | 3.8147e-06 | 9 | 9 |
RB | 88 | 1.1279e-08 | 2.6819e-08 | 3.8147e-06 | 9 | 9 | |
CMPB | 106 | 1.1279e-08 | 2.5277e-08 | 3.8147e-06 | 9 | 9 | |
RMPB | 106 | 1.1279e-08 | 2.2890e-07 | 3.8147e-06 | 9 | 9 | |
10 | CB | 1300 | 2.6761e+04 | 1.4694e+06 | 1.7937e-03 | 95 | 31 |
RB | 1300 | 4.7088e+06 | 1.4611e+08 | 8.6736e-05 | 114 | 16 | |
CMPB | 1300 | 8.7574e+06 | 3.3756e+08 | 5.5511e-02 | 99 | 12 | |
RMPB | 1300 | 6.6179e+06 | 2.0357e+08 | 8.6736e-04 | 95 | 15 | |
11 | CB | 611 | 2.1173e-12 | 5.3498e-06 | 1.2622e-08 | 50 | 27 |
RB | 586 | 8.6109e-10 | 9.3682e-06 | 8.0779e-09 | 48 | 26 | |
CMPB | 653 | 5.8997e-10 | 7.3782e-06 | 8.2718e-09 | 46 | 24 | |
RMPB | 1300 | 3.7178e-04 | 1.1058e-02 | 7.1746e-05 | 97 | 32 |
Prob. | op | nf | fmin | ng | h | itn | qmf |
---|---|---|---|---|---|---|---|
12 | CB | 258 | 1.0126e-06 | 8.9612e-06 | 5.6843e-10 | 22 | 18 |
RB | 132 | 7.8257e-17 | 1.1635e-08 | 5.9605e-06 | 12 | 10 | |
CMPB | 336 | 3.7634e-08 | 5.5972e-06 | 3.5527e-09 | 24 | 18 | |
RMPB | 170 | 8.7957e-13 | 9.6558e-08 | 1.4901e-06 | 13 | 11 | |
13 | CB | 388 | 9.5087e-09 | 7.8340e-06 | 8.8818e-10 | 29 | 19 |
RB | 680 | 1.2044e-11 | 3.4770e-06 | 8.2718e-10 | 53 | 25 | |
CMPB | 539 | 9.7590e-10 | 2.7194e-06 | 1.3878e-09 | 36 | 20 | |
RMPB | 732 | 1.2329e-09 | 2.2736e-06 | 5.2940e-09 | 49 | 23 | |
14 | CB | 496 | 2.2343e-13 | 3.8564e-06 | 1.3878e-08 | 39 | 19 |
RB | 664 | 1.0394e-12 | 9.6018e-06 | 1.3553e-08 | 52 | 21 | |
CMPB | 1175 | 7.0443e-21 | 3.2592e-09 | 3.1554e-10 | 79 | 29 | |
RMPB | 1200 | 2.6560e-16 | 4.1629e-07 | 3.0815e-09 | 80 | 30 | |
15 | CB | 409 | 3.0751e-04 | 7.0794e-08 | 3.4694e-10 | 29 | 21 |
RB | 730 | 3.0751e-04 | 8.1424e-07 | 3.0815e-09 | 54 | 30 | |
CMPB | 456 | 3.0751e-04 | 2.1624e-07 | 1.0000e-10 | 29 | 22 | |
RMPB | 879 | 3.0751e-04 | 2.9915e-06 | 1.9259e-10 | 56 | 32 | |
16 | CB | 218 | 8.5822e+04 | 2.0390e-01 | 2.3842e-05 | 18 | 9 |
RB | 256 | 8.5822e+04 | 5.2632e-02 | 5.9605e-06 | 21 | 10 | |
CMPB | 328 | 8.5822e+04 | 8.3039e-03 | 9.5367e-06 | 23 | 9 | |
RMPB | 269 | 8.5822e+04 | 1.3964e-02 | 5.9605e-06 | 19 | 10 | |
17 | CB | 1300 | 6.1863e-05 | 7.8087e-02 | 7.5232e-06 | 86 | 29 |
RB | 1300 | 9.2307e-02 | 6.5300e+01 | 1.5259e-03 | 86 | 6 | |
CMPB | 1300 | 3.3394e-03 | 6.3918e+00 | 9.5367e-04 | 77 | 7 | |
RMPB | 1300 | 5.3120e-02 | 1.9312e+01 | 9.5367e-04 | 78 | 7 | |
18 | CB | 522 | 5.6556e-03 | 5.2532e-06 | 8.6736e-09 | 30 | 20 |
RB | 1300 | 5.6557e-03 | 1.8064e-04 | 1.1210e-08 | 73 | 37 | |
CMPB | 712 | 5.6557e-03 | 3.1441e-06 | 2.1176e-09 | 36 | 23 | |
RMPB | 1300 | 5.6559e-03 | 8.2987e-04 | 7.3468e-08 | 65 | 33 | |
19 | CB | 1300 | 4.0437e-02 | 4.3713e-02 | 1.3235e-03 | 48 | 18 |
RB | 1300 | 2.8701e-01 | 2.5030e+00 | 6.2500e-02 | 49 | 2 | |
CMPB | 1300 | 4.6094e-02 | 2.3523e-01 | 5.2940e-03 | 45 | 17 | |
RMPB | 1300 | 2.8656e-01 | 3.4951e+00 | 6.2500e-02 | 46 | 2 | |
20 | CB | 80 | 9.2331e-18 | 1.3956e-08 | 3.8147e-06 | 10 | 9 |
RB | 78 | 1.0496e-17 | 1.4880e-08 | 3.8147e-06 | 10 | 9 | |
CMPB | 97 | 1.9978e-17 | 2.0529e-08 | 3.8147e-06 | 10 | 9 | |
RMPB | 99 | 8.0511e-18 | 1.3032e-08 | 3.8147e-06 | 10 | 9 | |
21 | CB | 77 | 2.3695e-19 | 1.8024e-09 | 3.8147e-06 | 10 | 9 |
RB | 77 | 1.5376e-16 | 6.2567e-08 | 3.8147e-06 | 10 | 9 | |
CMPB | 98 | 6.8671e-14 | 1.3603e-06 | 3.8147e-06 | 10 | 9 | |
RMPB | 96 | 7.0015e-15 | 4.3732e-07 | 3.8147e-06 | 10 | 9 | |
22 | CB | 1155 | 7.4976e-13 | 7.8573e-06 | 1.0000e-10 | 26 | 19 |
RB | 1244 | 7.6929e-13 | 8.5455e-06 | 1.0000e-10 | 28 | 18 | |
CMPB | 1207 | 2.9752e-13 | 5.3954e-06 | 1.0000e-10 | 26 | 19 | |
RMPB | 1113 | 4.3271e-13 | 8.0808e-06 | 1.0000e-10 | 24 | 18 |
The experiments show that the Frame Based Preconditioned Conjugate Gradients algorithm performs similarly regardless of which gradient and diagonal Hessian approximation is used, as expected.
7 Conclusion
This work studied the use of interpolation models to obtain approximations to the gradient and diagonal of the Hessian of a function . We showed that if the coordinate basis is used in the interpolation model, with the parameter used to generate the interpolation points, then one recovers the standard finite difference approximations to the derivative. We also showed that if a regular basis, a coordinate minimal positive basis, or a regular minimal positive basis are used within the interpolation model, then approximations to the gradient, and to the pure second derivatives are available in flops. An error bound was presented which shows that for each set of interpolation directions considered in this work, the gradient is accurate. Numerical experiments were presented that show the practical uses of the estimates developed. In particular, we showed how they could be employed within the FB-PCG algorithm to solve (1).
References
- [1] Charles Audet and Warren Hare. Derivative-free and blackbox optimization. Springer Series in Operations Research and Financial Engineering. Springer, 2017.
- [2] A. S. Bandeira, K. Scheinberg, and L. N. Vicente. Computation of sparse low degree interpolating polynomials and their application to derivative-free optimization. Mathematical Programming Series B, 134:223–257, 2012.
- [3] A. G. Baydin, B. A. Pearlmutter, A. A. Radul, and J. M. Siskind. Automatic differentiation in machine learning: a survey. Journal of Machine Learning Research, 18:1–43, 2018. Editor: Léon Bottou.
- [4] A. S. Berahas, L. Cao, K. Choromanskiy, and K. Scheinberg. A theoretical and empirical comparison of gradient approximations in derivative-free optimization. Technical report, Department of Industrial and Systems Engineering, Lehigh University,, Bethlehem, PA, USA, 2019. arXiv:1905.01332v2 [math.OC].
- [5] G. Cocchi, G. Liuzzi, A. Papini, and M. Sciandrone. An implicit filtering algorithm for derivative-free multiobjective optimization with box constraints. Computational Optimization and Applications, 69(2):267–296, 2018.
- [6] A. R. Conn, K. Scheinberg, and Luís N. Vicente. Geometry of interpolation sets in derivative free optimization. Mathematical Programming Series B, 111:141–172, 2008.
- [7] A. R. Conn, K. Scheinberg, and Luís N. Vicente. Geometry of sample sets in derivative-free optimization: polynomial regression and underdetermined interpolation. IMA Journal of Numerical Analysis, 28:721–748, 2008.
- [8] Andrew R. Conn and Philippe L. Toint. An algorithm using quadratic interpolation for unconstrained derivative free optimization. In G. Di Pillo and F. Giannessi, editors, Nonlinear Optimization and Applications, pages 27–47, Boston, MA, 1996. Springer US.
- [9] A.R. Conn, K.Scheinberg, and L.N. Vicente. Introduction to Derivative-Free Optimization. MPS-SIAM Series on Optimization, Philadelphia, 2009.
- [10] A.R. Conn, K.Scheinberg, and L.N. Vicente. Introduction to derivative-free optimization. MPS-SIAM Series on Optimization, Philadelphia, 2009.
- [11] A.R. Conn, K. Scheinberg, and P.L. Toint. Recent progress in unconstrained nonlinear optimization without derivatives. Mathematical Programming, 79:397–414, 1997.
- [12] I. D. Coope and C. J. Price. A derivative-free frame-based conjugate gradients method. Ucdms2002-7, University of Canterbury, 2002. https://ir.canterbury.ac.nz/handle/10092/11758.
- [13] I. D. Coope and R. Tappenden. Efficient calculation of regular simplex gradients. Computational Optimization and Applications, 72(3):561–588, April 2019.
- [14] I.D. Coope and C.J. Price. Frame-based methods for unconstrained optimization. Journal of Optimization Theory & Applications, 107(2):261–274, 2000.
- [15] I.D. Coope and C.J. Price. Positive bases in numerical optimization. Computational Optimization & Applications, 21:169–175, 2002.
- [16] Giovanni Fasano, José Luis Morales, and Jorge Nocedal. On the geometry phase in model-based algorithms for derivative-free optimization. Optimization Methods and Software, 24(1):145–154, 2009.
- [17] Maryam Fazel, Rong Ge, Sham Kakade, and Mehran Mesbahi. Global convergence of policy gradient methods for the linear quadratic regulator. Proceedings of Machine Learning Research (PMLR), 80:1467–1476, 2018. International Conference on Machine Learning, 10-15 July 2018, Stockholmsmässan, Stockholm, Sweden.
- [18] P. Gilmore, C. T. Kelley, C. T. Miller, and G. A. Williams. Implicit filtering and optimal design problems. In Jeffrey Borggaard, John Burkardt, Max Gunzburger, and Janet Peterson, editors, Optimal Design and Control, pages 159–176, Boston, MA, 1995. Birkhäuser Boston.
- [19] P. Gilmore and C.T. Kelley. An implicit filtering algorithm for optimization of functions with many local minima. SIAM J. Optim., 5(2):269–285, 1995.
- [20] W. Hare and M. Jaberipour. Adaptive interpolation strategies in derivative-free optimization: a case study. Technical report, University of British Colombia, Canada, and Amirkabir University of Technology, Iran, November 2015. arXiv:1511.02794v1 [math.OC].
- [21] Philipp H. W. Hoffmann. A hitchhiker’s guide to automatic differentiation. Numerical Algorithms, 72(3):775–811, 2016.
- [22] Gabriel Jarry-Bolduc, Patrick Nadeau, and Shambhavi Singh. Uniform simplex of an arbitrary orientation. Optimization Letters, 2019. Published online 03 July 2019, https://doi.org/10.1007/s11590-019-01448-3.
- [23] Q. Liu. Two minimal positive basis based direct search conjugate gradient methods for computationally expensive functions. Numer Algor, 58:461–474, 2011.
- [24] Alvaro Maggiar, Andreas Wächter, Irina S Dolinskaya, and Jeremy Staum. A derivative-free trust-region algorithm for the optimization of functions smoothed via gaussian convolution using adaptive multiple importance sampling. SIAM Journal on Optimization, 28(2):1478–1507, 2018.
- [25] C. C. Margossian. A review of automatic differentiation and its efficient implementation. Technical report, Department of Statistics, Columbia University, January 2019. arXiv:1811.05031v2 [cs.MS].
- [26] J. J. Moré and S. Wild. Benchmarking derivative-free optimization algorithms. SIAM Journal on Optimization, 20(1):172–191, 2009.
- [27] J.J. More, B.S. Garbow, and K.E. Hillstrom. Testing unconstrained optimization software. ACM Trans. Math. Softw., 7(1):17–41, 1981.
- [28] J.A. Nelder and R. Mead. A simplex method for function minimization. The computer journal, 7(4):308–313, 1965.
- [29] Yurii Nesterov and Vladimir Spokoiny. Random gradient-free minimization of convex functions. Foundations of Computational Mathematics, 17(2):527–566, 2017.
- [30] Jorge Nocedal and Stephen J. Wright. Numerical Optimization. Springer Series in Operations Research. Springer, 2 edition, 2006.
- [31] Luis Miguel Rios and Nikolaos V. Sahinidis. Derivative-free optimization: a review of algorithms and comparison of software implementations. J Glob Optim, 56:1247–1293, 2013.
- [32] W. Spendley, G.R. Hext, and F.R. Himsworth. Sequential application of simplex designs in optimisation and evolutionary operation. Technometrics, 4:441–461, 1962.
- [33] Stefan M. Wild and Christine Shoemaker. Global convergence of radial basis function trust-region algorithms for derivative-free optimization. SIAM Review, 55(2):349–371, 2013.
Appendix A Constants
In this section we state several equivalences between constants that are used in this work.
(85) |
(86) |
(87) | |||||
(88) |
(89) |
(90) |
(91) |
(92) |
(93) |
(94) |