This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Model Construction for Convex-Constrained Derivative-Free Optimization

Lindon Roberts School of Mathematics and Statistics, University of Sydney, Camperdown NSW 2006, Australia ([email protected]). This work was supported by the Australian Research Council Discovery Early Career Award DE240100006.
Abstract

We develop a new approximation theory for linear and quadratic interpolation models, suitable for use in convex-constrained derivative-free optimization (DFO). Most existing model-based DFO methods for constrained problems assume the ability to construct sufficiently accurate approximations via interpolation, but the standard notions of accuracy (designed for unconstrained problems) may not be achievable by only sampling feasible points, and so may not give practical algorithms. This work extends the theory of convex-constrained linear interpolation developed in [Hough & Roberts, SIAM J. Optim, 32:4 (2022), pp. 2552–2579] to the case of linear regression models and underdetermined quadratic interpolation models.

Keywords: derivative-free optimization, convex constraints, interpolation

Mathematics Subject Classification: 65K05, 90C30, 90C56

1 Introduction

This work is concerned with model-based methods for derivative-free optimization (DFO). Such methods are widely studied and have proven to be popular in practice [16, 1, 25]. Most commonly, model-based methods construct a local interpolant for the objective from selected evaluations which is then used in place of Taylor series in a standard framework, most commonly trust-region methods [13]. The convergence and worst-case complexity of model-based DFO in the unconstrained setting is well-established. Here, we are interested in nonconvex minimization over a convex set,

min𝒙Cf(𝒙),\displaystyle\min_{\bm{x}\in C}f(\bm{x}), (1.1)

where f:nf:\mathbb{R}^{n}\to\mathbb{R} is smooth (but nonconvex), and CnC\subseteq\mathbb{R}^{n} is a closed, convex set with nonempty interior. Such problems appear in the derivative-free setting, for example bound and linear constraints appearing in problems from climate science [31, 30], deep learning [33] and systems design engineering [23]. Although a model-based DFO framework for solving (1.1) was recently developed in [22], it only considered the case of constructing linear interpolants from function evaluations (at feasible points). Outside of structured problems such as nonlinear least-squares [8], quadratic interpolants are the most common models used in model-based DFO [28, 29, 16]. In this work, we develop a comprehensive approximation theory for linear regression and underdetermined quadratic interpolation models based on sampling only from feasible points. This theory fits the requirements of the method from [22], and so allows this framework to be used with more practical model choices.

1.1 Existing work

Model-based DFO methods for unconstrained problems have well-established convergence [13, 16] and worst-case complexity theory [17, 25]. This theory relies on building (typically linear or quadratic) interpolants for the objective function based on samples of the objective at specific points, which must be “fully linear”, i.e. satisfying specific error bounds. Unlike much approximation theory in numerical analysis (e.g. [32]), it is typically assumed that we have limited ability to select the sample points and instead should use as many pre-existing evaluations as possible (motivated by problems with computationally expensive objectives, an important DFO use case). In this setting, the model accuracy can be bounded by quantities only depending on the choice of sampling points (and information about the smoothness of the objective). Originally in [13], the concept of “Λ\Lambda-poised” interpolation sets was introduced as a sufficient condition for constructing fully linear interpolation models, as well as developing procedures for improving the choice of sample points (while retaining as many existing points as possible) to construct a Λ\Lambda-poised set. This was extended in [14] for polynomial regression and underdetermined polynomial interpolation models. A specific form of underdetermined quadratic interpolation, namely minimum Frobenius interpolation, was introduced and extensively studied in [26, 27] and is the foundation of several state-of-the-art codes [28, 29, 6]. An analysis of minimum Frobenius interpolation in terms of Λ\Lambda-poisedness/fully linear error bounds was given in [16]. Alternate model constructions have been proposed based on weighted linear regression [5], underdetermined quadratic interpolation based on Sobolev norms [37], radial basis functions (RBF) [35] and combinations of polynomial and RBFs [2], for example.

Many model-based DFO methods have been proposed for constrained optimization; see [25, Section 7] for a thorough survey. We note in particular the following strictly feasible methods: [10] studies convex-constrained problems, [29, 19] and [34, Section 6.3] consider bound constraints, and [20] considers linear inequality constraints. Strictly feasible methods for nonconvex-constrainted problems have also been considered in [11, 9, 36]. Of these, where convergence was established it was based on the assumption that fully linear models could be constructed using the same Λ\Lambda-poised interpolation sets as in the unconstrained case. However, this is potentially impractical, as in some settings “it may be impossible to obtain a fully linear model using only feasible points[25, p. 362].

More recently, [22] introduces a model-based DFO method for solving (1.1) by adapting the unconstrained approach from [15] based on gradient-based trust-region methods for convex-constrained optimization [12, Chapter 12]. The method has convergence and worst-case complexity results which align with unconstrained model-based DFO [17] and convex-constrained gradient-based trust-region methods [7]. To make this approach practical, it introduced a weaker notion of fully linear models which are sufficient for algorithm convergence, but which can be attained using linear interpolation over a Λ\Lambda-poised set of only feasible points. It also demonstrated how to modify an interpolation set to make it Λ\Lambda-poised (without changing too many points). However, linear models are not typically used in practice, except for special cases (such as nonlinear least-squares minimization [8]), with quadratic models being preferred [28, 29, 6].

We additionally note that the BOBYQA code [29] for bound-constrained optimization uses maximization of Lagrange polynomials inside the feasible region—similar to how we will define Λ\Lambda-poisedness in this setting—but this was a heuristic with no theoretical analysis.111In fact, our results here give some theoretical justification to this approach.

1.2 Contributions

The contributions of this work are:

  • A generalization of the linear interpolation theory (based on sampling feasible points) from [22] to the case of linear regression models. This includes fully linear error bounds and procedures to generate Λ\Lambda-poised interpolation sets (which are simpler than those developed in [22]).

  • A full approximation theory for underdetermined quadratic interpolation (in the style of [26, 27]) based on sampling only feasible points. This includes derivation of fully linear error bounds, defining the relevant notion of Λ\Lambda-poisedness in this setting, and procedures for efficiently generating Λ\Lambda-poised sets. By construction, in the case of a full interpolation set of (n+1)(n+2)/2(n+1)(n+2)/2 points for quadratic interpolation, this approach produces the same result as standard quadratic interpolation, and so our results also apply to this setting.

While the linear regression theory is of independent interest for DFO applied to noisy objectives [24], it is primarily included as a prerequisite for deriving the quadratic results. The results proven in both cases are sufficient to enable these model constructions to be used in the algorithmic framework from [22] (with the associated convergence and worst-case complexity guarantees).

While the formulations and results are very similar to those for the unconstrained case from [14], the constrained setting necessitates new approaches for proving the results (which when applied to the C=nC=\mathbb{R}^{n} case provide simpler proofs of existing results).

Organization

A summary of the modified fully linear model accuracy requirement and associated algorithm from [22] is given in Section 2. That Λ\Lambda-poisedness (only over the feasible region) guarantees fully linear interpolation models is first shown for linear regression models in Section 3 and then for (underdetermined) quadratic interpolation models in Section 4. Lastly, Section 5 provides concrete procedures for identifying and constructing Λ\Lambda-poised quadratic interpolation models using only feasible points.

Notation

We let \|\cdot\| be the Euclidean norm of vectors and the operator 2-norm of matrices, and use B(𝒙,Δ)B(\bm{x},\Delta) for 𝒙n\bm{x}\in\mathbb{R}^{n} and Δ>0\Delta>0 for the closed ball {𝒚n:𝒚𝒙Δ}\{\bm{y}\in\mathbb{R}^{n}:\|\bm{y}-\bm{x}\|\leq\Delta\}.

2 Background

This section summarizes the key background and results from [22], which provide a model-based DFO algorithm for solving (1.1).

We now outline how problem (1.1) can be solved using model-based DFO methods. We begin with outlining our problem assumptions.

Assumption 2.1.

The objective function f:nf:\mathbb{R}^{n}\to\mathbb{R} is bounded below by flowf_{\textnormal{low}} and continuously differentiable, and f\nabla f is LfL_{\nabla f}-Lipschitz continuous.

Although f\nabla f exists, we assume that it is not available to the algorithm as an oracle.

Assumption 2.2.

The feasible set CnC\subseteq\mathbb{R}^{n} is closed and convex with nonempty interior.

Assumption 2.2 implies that the Euclidean projection operator for CC,

projC(𝒚):=argmin𝒙C𝒚𝒙2,\displaystyle\operatorname{proj}_{C}(\bm{y}):=\operatorname*{arg\,min}_{\bm{x}\in C}\|\bm{y}-\bm{x}\|^{2}, (2.1)

is a well-defined function projC:nC\operatorname{proj}_{C}:\mathbb{R}^{n}\to C [3, Theorem 6.25]. Our framework is based on a setting in which projC\operatorname{proj}_{C} is inexpensive to evaluate (at least compared to evaluations of the objective ff). Several examples of feasible sets CC which have simple analytic expressions for projC\operatorname{proj}_{C} are given in [3, Table 6.1].

In this work, as in [22], we aim to find a first-order optimal solution to (1.1). To measure this, we use the following first-order criticality measure from [12, Section 12.1.4], defined for all 𝒙C\bm{x}\in C:

πf(𝒙):=|min𝒙+𝒅C𝒅1f(𝒙)T𝒅|.\displaystyle\pi^{f}(\bm{x}):=\left|\min_{\begin{subarray}{c}\bm{x}+\bm{d}\in C\\ \|\bm{d}\|\leq 1\end{subarray}}\nabla f(\bm{x})^{T}\bm{d}\right|. (2.2)

From [12, Theorem 12.1.6], we have that πf\pi^{f} is a non-negative, continuous function of 𝒙\bm{x} and πf(𝒙)=0\pi^{f}(\bm{x}^{*})=0 if and only if 𝒙\bm{x}^{*} is a first-order critical point for (1.1). In the unconstrained case C=nC=\mathbb{R}^{n}, then (2.2) simplifies to πf(𝒙)=f(𝒙)\pi^{f}(\bm{x})=\|\nabla f(\bm{x})\|. For notational convenience, we define πkf:=πf(𝒙k)\pi^{f}_{k}:=\pi^{f}(\bm{x}_{k}), where 𝒙kn\bm{x}_{k}\in\mathbb{R}^{n} is the kk-th iterate of our algorithm.

2.1 Model Construction & Accuracy

We will consider a model-based DFO approach for solving (1.1). Specifically, at each iteration kk we construct a local quadratic approximation to ff which we hope to be accurate near the current iterate 𝒙kn\bm{x}_{k}\in\mathbb{R}^{n}:

f(𝒚)mk(𝒚):=ck+𝒈kT(𝒚𝒙k)+12(𝒚𝒙k)THk(𝒚𝒙k),\displaystyle f(\bm{y})\approx m_{k}(\bm{y}):=c_{k}+\bm{g}_{k}^{T}(\bm{y}-\bm{x}_{k})+\frac{1}{2}(\bm{y}-\bm{x}_{k})^{T}H_{k}(\bm{y}-\bm{x}_{k}), (2.3)

for some ckc_{k}\in\mathbb{R}, 𝒈kn\bm{g}_{k}\in\mathbb{R}^{n} and Hkn×nH_{k}\in\mathbb{R}^{n\times n} with Hk=HkTH_{k}=H_{k}^{T}.

Assumption 2.3.

There exists κH1\kappa_{H}\geq 1 such that HkκH1\|H_{k}\|\leq\kappa_{H}-1 for all kk.

Convergence of our algorithm will require that the local models mkm_{k} (2.3) are sufficiently accurate approximations of the objective ff (at least on some iterations). The required notion of ‘sufficiently accurate’ is the following:

Definition 2.4.

The model mkm_{k} (2.3) is CC-fully linear in B(𝒙k,Δk)B(\bm{x}_{k},\Delta_{k}) if there exist κef,κeg>0\kappa_{\textnormal{ef}},\kappa_{\textnormal{eg}}>0, independent of kk, such that

max𝒙k+𝒅C𝒅Δk|f(𝒙k+𝒅)mk(𝒙k+𝒅)|\displaystyle\max_{\begin{subarray}{c}\bm{x}_{k}+\bm{d}\in C\\ \|\bm{d}\|\leq\Delta_{k}\end{subarray}}|f(\bm{x}_{k}+\bm{d})-m_{k}(\bm{x}_{k}+\bm{d})| κefΔk2,\displaystyle\leq\kappa_{\textnormal{ef}}\Delta_{k}^{2}, (2.4a)
max𝒙k+𝒅C𝒅1|(f(𝒙k)𝒈k)T𝒅|\displaystyle\max_{\begin{subarray}{c}\bm{x}_{k}+\bm{d}\in C\\ \|\bm{d}\|\leq 1\end{subarray}}\left|(\nabla f(\bm{x}_{k})-\bm{g}_{k})^{T}\bm{d}\right| κegΔk.\displaystyle\leq\kappa_{\textnormal{eg}}\Delta_{k}. (2.4b)

When we wish to refer to this notion in the abstract sense, we use the term “CC-full linearity”.

We note in particular that (2.4a) uses the constraint 𝒅Δk\|\bm{d}\|\leq\Delta_{k} and (2.4b) uses 𝒅1\|\bm{d}\|\leq 1. This essentially corresponds to the standard definition of fully linear models [15, Definition 3.1] in the unconstrained case C=nC=\mathbb{R}^{n}.

Our algorithm will require the existence of two procedures related to CC-full linearity: given a model mkm_{k}, iterate 𝒙k\bm{x}_{k} and trust-region radius Δk\Delta_{k}, we assume that we can

  • Verify whether or not mkm_{k} is CC-fully linear in B(𝒙k,Δk)B(\bm{x}_{k},\Delta_{k}); and

  • If mkm_{k} is not CC-fully linear, create a new model (typically a small modification of mkm_{k}) which is CC-fully linear in B(𝒙k,Δk)B(\bm{x}_{k},\Delta_{k}).

Lastly, our model mkm_{k} (2.3) induces an approximate first-order criticality measure, namely

πm(𝒙):=|min𝒙+𝒅C𝒅1𝒈kT𝒅|.\displaystyle\pi^{m}(\bm{x}):=\left|\min_{\begin{subarray}{c}\bm{x}+\bm{d}\in C\\ \|\bm{d}\|\leq 1\end{subarray}}\bm{g}_{k}^{T}\bm{d}\right|. (2.5)

As above, for convenience we will define πkm:=πm(𝒙k)\pi^{m}_{k}:=\pi^{m}(\bm{x}_{k}).

2.2 Algorithm Specification

Our DFO method is based on a trust-region framework. That is, given the local model mkm_{k} (2.3), we calculate a potential new iterate as 𝒙k+𝒔k\bm{x}_{k}+\bm{s}_{k}, where 𝒔k\bm{s}_{k} is an approximate minimizer of the trust-region subproblem

min𝒙+𝒔C𝒔Δkmk(𝒙k+𝒔),\displaystyle\min_{\begin{subarray}{c}\bm{x}+\bm{s}\in C\\ \|\bm{s}\|\leq\Delta_{k}\end{subarray}}m_{k}(\bm{x}_{k}+\bm{s}), (2.6)

where Δk>0\Delta_{k}>0 is a parameter updated dynamically inside the algorithm. Formally, we require the following, which can be achieved using a Goldstein-type linesearch method [12, Algorithm 12.2.2 & Theorem 12.2.2].

Assumption 2.5.

There exists a constant c1(0,1)c_{1}\in(0,1) such that the computed step 𝒔k\bm{s}_{k} satisfies 𝒙k+𝒔kC\bm{x}_{k}+\bm{s}_{k}\in C, 𝒔kΔk\|\bm{s}_{k}\|\leq\Delta_{k} and the generalized Cauchy decrease condition:

mk(𝒙k)mk(𝒙k+𝒔k)c1πkmmin(πkm1+Hk,Δk,1).\displaystyle m_{k}(\bm{x}_{k})-m_{k}(\bm{x}_{k}+\bm{s}_{k})\geq c_{1}\pi^{m}_{k}\min\left(\frac{\pi^{m}_{k}}{1+\|H_{k}\|},\Delta_{k},1\right). (2.7)

The full derivative-free algorithm for solving (1.1) from [22], called CDFO-TR, is given in Algorithm 1. The overall structure is similar to other model-based DFO methods, such as [15, Algorithm 4.1] for unconstrained minimization.

1:Starting point 𝒙0n\bm{x}_{0}\in\mathbb{R}^{n} and trust-region radius Δ0>0\Delta_{0}>0.
2:Parameters: maximum trust-region radius ΔmaxΔ0\Delta_{\max}\geq\Delta_{0}, scaling factors 0<γdec<1<γinc0<\gamma_{\textnormal{dec}}<1<\gamma_{\textnormal{inc}}, criticality constants ϵC,μ>0\epsilon_{C},\mu>0, and acceptance threshold η(0,1)\eta\in(0,1).
3:Build an initial model m0m_{0} (2.3).
4:for k=0,1,2,k=0,1,2,\ldots do
5:     if πkm<ϵC\pi^{m}_{k}<\epsilon_{C} and (πkm<μ1Δk\pi^{m}_{k}<\mu^{-1}\Delta_{k} or mkm_{k} is not CC-fully linear in B(𝐱k,Δk)B(\bm{x}_{k},\Delta_{k})) then
6:         Criticality step: Set 𝒙k+1=𝒙k\bm{x}_{k+1}=\bm{x}_{k}. If mkm_{k} is CC-fully linear in B(𝒙k,Δk)B(\bm{x}_{k},\Delta_{k}), set Δk+1=γdecΔk\Delta_{k+1}=\gamma_{\textnormal{dec}}\Delta_{k}, otherwise set Δk+1=Δk\Delta_{k+1}=\Delta_{k}. Construct mk+1m_{k+1} to be CC-fully linear in B(𝒙k+1,Δk+1)B(\bm{x}_{k+1},\Delta_{k+1}).
7:     else\leftarrow πkmϵC\pi^{m}_{k}\geq\epsilon_{C} or (πkmμ1Δk\pi^{m}_{k}\geq\mu^{-1}\Delta_{k} and mkm_{k} is CC-fully linear in B(𝐱k,Δk)B(\bm{x}_{k},\Delta_{k}))
8:         Approximately solve (2.6) to get a step 𝒔k\bm{s}_{k}.
9:         Evaluate f(𝒙k+𝒔k)f(\bm{x}_{k}+\bm{s}_{k}) and calculate ratio
ρk:=f(𝒙k)f(𝒙k+𝒔k)mk(𝒙k)mk(𝒙k+𝒔k).\displaystyle\rho_{k}:=\frac{f(\bm{x}_{k})-f(\bm{x}_{k}+\bm{s}_{k})}{m_{k}(\bm{x}_{k})-m_{k}(\bm{x}_{k}+\bm{s}_{k})}. (2.8)
10:         if ρkη\rho_{k}\geq\eta then
11:              Successful step: Set 𝒙k+1=𝒙k+𝒔k\bm{x}_{k+1}=\bm{x}_{k}+\bm{s}_{k} and Δk+1=min(γincΔk,Δmax)\Delta_{k+1}=\min(\gamma_{\textnormal{inc}}\Delta_{k},\Delta_{\max}). Form mk+1m_{k+1} in any manner.
12:         else if mkm_{k} is not CC-fully linear in B(𝒙k,ΔkB(\bm{x}_{k},\Delta_{k}then
13:              Model-improving step: Set 𝒙k+1=𝒙k\bm{x}_{k+1}=\bm{x}_{k} and Δk+1=Δk\Delta_{k+1}=\Delta_{k}, and construct mk+1m_{k+1} to be CC-fully linear in B(𝒙k+1,Δk+1)B(\bm{x}_{k+1},\Delta_{k+1}).
14:         else\leftarrow ρk<η\rho_{k}<\eta and mkm_{k} is CC-fully linear in B(𝐱k,Δk)B(\bm{x}_{k},\Delta_{k})
15:              Unsuccessful step: Set 𝒙k+1=𝒙k\bm{x}_{k+1}=\bm{x}_{k} and Δk+1=γdecΔk\Delta_{k+1}=\gamma_{\textnormal{dec}}\Delta_{k}. Form mk+1m_{k+1} in any manner.
16:         end if
17:     end if
18:end for
Algorithm 1 CDFO-TR: model-based DFO method for (1.1), from [22].

We have the following global convergence and worst-case complexity results for Algorithm 1 from [22].

Theorem 2.6 (Theorem 3.10 & Corollary 3.15, [22]).

If Assumptions 2.1, 2.3 and 2.5 hold, then limkπkf=0\lim_{k\to\infty}\pi^{f}_{k}=0. Moreover, if ϵ(0,1]\epsilon\in(0,1] and ϵCc2ϵ\epsilon_{C}\geq c_{2}\epsilon for some constant c2>0c_{2}>0, then the number of iterations kk before Algorithm 1 produces an iterate with πkf<ϵ\pi^{f}_{k}<\epsilon is at most 𝒪(κHκd2ϵ2)\mathcal{O}(\kappa_{H}\kappa_{d}^{2}\epsilon^{-2}), where κd:=max(κef,κeg)\kappa_{d}:=\max(\kappa_{\textnormal{ef}},\kappa_{\textnormal{eg}}).

In [22], details are given on how to construct fully linear models based on linear interpolation to feasible points. Although linear models can be practical for some structured problems, such as nonlinear least-squares objectives (e.g. [8]), quadratic models are generally preferred. The remainder of this paper is devoted to the construction of fully linear quadratic models by only sampling the objective at feasible points.

3 Linear Regression Models

We first consider how to extend the linear interpolation approximation theory from [22] to the case of linear regression models (with the slight extra generalization that the base point 𝒙\bm{x} does not need to be an interpolation point). The purpose of this is twofold: regression models can be useful for modelling noisy objectives (e.g. [5]), and this theory will be necessary to develop the corresponding quadratic interpolation theory in Section 4. In the unconstrained case C=nC=\mathbb{R}^{n}, these results were originally proven in [14].

Suppose we have sampled the function ff at pp points {𝒚1,,𝒚p}n\{\bm{y}_{1},\ldots,\bm{y}_{p}\}\subset\mathbb{R}^{n} (where pn+1p\geq n+1), and given this information we wish to find a linear model

f(𝒚)m(𝒚):=c+𝒈T(𝒚𝒙),\displaystyle f(\bm{y})\approx m(\bm{y}):=c+\bm{g}^{T}(\bm{y}-\bm{x}), (3.1)

by solving

minc,𝒈×nt=1p(f(𝒚t)m(𝒚t))2.\displaystyle\min_{c,\bm{g}\in\mathbb{R}\times\mathbb{R}^{n}}\>\sum_{t=1}^{p}(f(\bm{y}_{t})-m(\bm{y}_{t}))^{2}. (3.2)

Equivalently, we can find the cc and 𝒈\bm{g} for our model by finding the least-squares solution to the p×(n+1)p\times(n+1) system

M[c𝒈]:=[1(𝒚1𝒙)T1(𝒚p𝒙)T][c𝒈]=[f(𝒚1)f(𝒚p)],\displaystyle M\begin{bmatrix}c\\ \bm{g}\end{bmatrix}:=\begin{bmatrix}1&(\bm{y}_{1}-\bm{x})^{T}\\ \vdots&\vdots\\ 1&(\bm{y}_{p}-\bm{x})^{T}\end{bmatrix}\begin{bmatrix}c\\ \bm{g}\end{bmatrix}=\begin{bmatrix}f(\bm{y}_{1})\\ \vdots\\ f(\bm{y}_{p})\end{bmatrix}, (3.3)

which may be written as

[c𝒈]=M[f(𝒚1)f(𝒚p)],\displaystyle\begin{bmatrix}c\\ \bm{g}\end{bmatrix}=M^{\dagger}\begin{bmatrix}f(\bm{y}_{1})\\ \vdots\\ f(\bm{y}_{p})\end{bmatrix}, (3.4)

where M(n+1)×pM^{\dagger}\in\mathbb{R}^{(n+1)\times p} is the Moore-Penrose pseudoinverse of MM.

Later we will require the following standard properties of MM^{\dagger} (see, e.g. [18, Section 5.5.4]).

Lemma 3.1.

For pn+1p\geq n+1, the Moore-Penrose pesudoinverse MM^{\dagger} of MM (3.3) satisfies (MT)=(M)T(M^{T})^{\dagger}=(M^{\dagger})^{T} and MMM=MM^{\dagger}MM^{\dagger}=M^{\dagger}. The minimal-norm solution to the underdetermined system MT𝐮=𝐯M^{T}\bm{u}=\bm{v} is 𝐮=(MT)𝐯\bm{u}=(M^{T})^{\dagger}\bm{v}.

The quality of the choice of interpolation points will be assessed by considering the associated set of Lagrange polynomials. In this case, the Lagrange polynomials associated with our interpolation set are the linear functions

t(𝒚):=ct+𝒈tT(𝒚𝒙),t=1,,p,\displaystyle\ell_{t}(\bm{y}):=c_{t}+\bm{g}_{t}^{T}(\bm{y}-\bm{x}),\qquad\forall t=1,\ldots,p, (3.5)

each defined by the least-squares regression problem

minct,𝒈t×ns=1p+1(δs,tt(𝒚s))2,t=1,,p,\displaystyle\min_{c_{t},\bm{g}_{t}\in\mathbb{R}\times\mathbb{R}^{n}}\>\sum_{s=1}^{p+1}(\delta_{s,t}-\ell_{t}(\bm{y}_{s}))^{2},\qquad\forall t=1,\ldots,p, (3.6)

or equivalently

[ct𝒈t]=M𝒆t,t=1,,p.\displaystyle\begin{bmatrix}c_{t}\\ \bm{g}_{t}\end{bmatrix}=M^{\dagger}\bm{e}_{t},\qquad\forall t=1,\ldots,p. (3.7)

Our notion of the sampled points {𝒚1,,𝒚p}\{\bm{y}_{1},\ldots,\bm{y}_{p}\} being a ‘good’ choice is given by the Lagrange polynomials having small magnitude in the region of interest. This is formalized in the following notion, which generalizes [14, Definition 2.7] to the convex feasible region CC.

Definition 3.2.

Given Λ1\Lambda\geq 1, the set {𝒚1,,𝒚p}C\{\bm{y}_{1},\ldots,\bm{y}_{p}\}\subset C is Λ\Lambda-poised for linear regression in B(𝒙,Δ)CB(\bm{x},\Delta)\cap C if {𝒚2𝒚1,,𝒚p𝒚1}\{\bm{y}_{2}-\bm{y}_{1},\ldots,\bm{y}_{p}-\bm{y}_{1}\} spans n\mathbb{R}^{n} and

maxt=1,,p|t(𝒚)|Λ,𝒚CB(𝒙,min(Δ,1)).\displaystyle\max_{t=1,\ldots,p}|\ell_{t}(\bm{y})|\leq\Lambda,\qquad\forall\bm{y}\in C\cap B(\bm{x},\min(\Delta,1)). (3.8)

We note that if {𝒚1,,𝒚p}\{\bm{y}_{1},\ldots,\bm{y}_{p}\} is Λ\Lambda-poised for linear regression, then MM has full column rank (since {𝒚2𝒚1,,𝒚p𝒚1}\{\bm{y}_{2}-\bm{y}_{1},\ldots,\bm{y}_{p}-\bm{y}_{1}\} is assumed to span n\mathbb{R}^{n}). This requirement also necessitates that pn+1p\geq n+1.

Assuming the Λ\Lambda-poisedness of {𝒚1,,𝒚p}\{\bm{y}_{1},\ldots,\bm{y}_{p}\} will be sufficient for the associated linear regression model to be CC-fully linear. The proof of this follows a similar structure to [22, Lemma 4.3 & Theorem 4.4], but with an increased complexity to the proofs coming from using regression instead of interpolation.

Lemma 3.3.

Suppose ff satisfies Assumption 2.1 and CC satisfies Assumption 2.2. Then if {𝐲1,,𝐲p}\{\bm{y}_{1},\ldots,\bm{y}_{p}\} is Λ\Lambda-poised for linear regression in B(𝐱,Δ)CB(\bm{x},\Delta)\cap C and 𝐲t𝐱βmin(Δ,1)\|\bm{y}_{t}-\bm{x}\|\leq\beta\min(\Delta,1) for all t=1,,pt=1,\ldots,p and some β>0\beta>0, we have

|m(𝒚)f(𝒙)f(𝒙)T(𝒚𝒙)|pΛLfβ22min(Δ,1)2,\displaystyle|m(\bm{y})-f(\bm{x})-\nabla f(\bm{x})^{T}(\bm{y}-\bm{x})|\leq\frac{p\Lambda L_{\nabla f}\beta^{2}}{2}\min(\Delta,1)^{2}, (3.9)

for all 𝐲B(𝐱,min(Δ,1))C\bm{y}\in B(\bm{x},\min(\Delta,1))\cap C. If we also have 𝐱C\bm{x}\in C, this in turn implies error bounds on cc and 𝐠\bm{g} individually, namely

|cf(𝒙)|pΛLfβ22min(Δ,1)2,\displaystyle|c-f(\bm{x})|\leq\frac{p\Lambda L_{\nabla f}\beta^{2}}{2}\min(\Delta,1)^{2}, (3.10)

and

|(𝒚𝒙)T(𝒈f(𝒙))|pΛLfβ2min(Δ,1)2,\displaystyle|(\bm{y}-\bm{x})^{T}(\bm{g}-\nabla f(\bm{x}))|\leq p\Lambda L_{\nabla f}\beta^{2}\min(\Delta,1)^{2}, (3.11)

for all 𝐲B(𝐱,min(Δ,1))C\bm{y}\in B(\bm{x},\min(\Delta,1))\cap C.

Proof.

We begin by defining the residual of the least-squares problem (3.3) as

𝒓:=M[c𝒈][f(𝒚1)f(𝒚p)]=(MMI)[f(𝒚1)f(𝒚p)]p.\displaystyle\bm{r}:=M\begin{bmatrix}c\\ \bm{g}\end{bmatrix}-\begin{bmatrix}f(\bm{y}_{1})\\ \vdots\\ f(\bm{y}_{p})\end{bmatrix}=(MM^{\dagger}-I)\begin{bmatrix}f(\bm{y}_{1})\\ \vdots\\ f(\bm{y}_{p})\end{bmatrix}\in\mathbb{R}^{p}. (3.12)

Then for all t=1,,pt=1,\ldots,p have

m(𝒚t)f(𝒙)f(𝒙)T(𝒚t𝒙)\displaystyle m(\bm{y}_{t})-f(\bm{x})-\nabla f(\bm{x})^{T}(\bm{y}_{t}-\bm{x}) =[c𝒈]T[1𝒚t𝒙]=MT𝒆tf(𝒙)f(𝒙)T(𝒚t𝒙),\displaystyle=\begin{bmatrix}c\\ \bm{g}\end{bmatrix}^{T}\underbrace{\begin{bmatrix}1\\ \bm{y}_{t}-\bm{x}\end{bmatrix}}_{=M^{T}\bm{e}_{t}}-f(\bm{x})-\nabla f(\bm{x})^{T}(\bm{y}_{t}-\bm{x}), (3.13)
=𝒓T𝒆t+f(𝒚t)f(𝒙)f(𝒙)T(𝒚t𝒙).\displaystyle=\bm{r}^{T}\bm{e}_{t}+f(\bm{y}_{t})-f(\bm{x})-\nabla f(\bm{x})^{T}(\bm{y}_{t}-\bm{x}). (3.14)

Now, fix 𝒚B(𝒙,min(Δ,1))C\bm{y}\in B(\bm{x},\min(\Delta,1))\cap C. Since MM has full column rank, its rows span n+1\mathbb{R}^{n+1} and so there exist constants αt(𝒚)\alpha_{t}(\bm{y}) such that

[1𝒚𝒙]=t=1pαt(𝒚)[1𝒚t𝒙]=MT𝜶(𝒚).\displaystyle\begin{bmatrix}1\\ \bm{y}-\bm{x}\end{bmatrix}=\sum_{t=1}^{p}\alpha_{t}(\bm{y})\begin{bmatrix}1\\ \bm{y}_{t}-\bm{x}\end{bmatrix}=M^{T}\bm{\alpha}(\bm{y}). (3.15)

Of these, we take the 𝜶(𝒚)\bm{\alpha}(\bm{y}) with minimal norm (c.f. Lemma 3.1),

𝜶(𝒚)=(MT)[1𝒚𝒙].\displaystyle\bm{\alpha}(\bm{y})=(M^{T})^{\dagger}\begin{bmatrix}1\\ \bm{y}-\bm{x}\end{bmatrix}. (3.16)

There are two key properties of 𝜶(𝒚)\bm{\alpha}(\bm{y}): firstly,

t(𝒚)=[ct𝒈t]T[1𝒚𝒙]=𝒆tT(M)T[1𝒚𝒙]=αt(𝒚),\displaystyle\ell_{t}(\bm{y})=\begin{bmatrix}c_{t}\\ \bm{g}_{t}\end{bmatrix}^{T}\begin{bmatrix}1\\ \bm{y}-\bm{x}\end{bmatrix}=\bm{e}_{t}^{T}(M^{\dagger})^{T}\begin{bmatrix}1\\ \bm{y}-\bm{x}\end{bmatrix}=\alpha_{t}(\bm{y}), (3.17)

where the last equality follows (MT)=(M)T(M^{T})^{\dagger}=(M^{\dagger})^{T} (Lemma 3.1). Hence |αt(𝒚)|=|t(𝒚)|Λ|\alpha_{t}(\bm{y})|=|\ell_{t}(\bm{y})|\leq\Lambda from the Λ\Lambda-poisedness condition. Secondly, we have

𝜶(𝒚)T𝒓=[1𝒚𝒙]TM(MMI)[f(𝒚1)f(𝒚p)]=0,\displaystyle\bm{\alpha}(\bm{y})^{T}\bm{r}=\begin{bmatrix}1\\ \bm{y}-\bm{x}\end{bmatrix}^{T}M^{\dagger}(MM^{\dagger}-I)\begin{bmatrix}f(\bm{y}_{1})\\ \vdots\\ f(\bm{y}_{p})\end{bmatrix}=0, (3.18)

from MMM=MM^{\dagger}MM^{\dagger}=M^{\dagger} (Lemma 3.1). All together, we have

|m(𝒚)f(𝒙)f(𝒙)T(𝒚𝒙)|\displaystyle|m(\bm{y})-f(\bm{x})-\nabla f(\bm{x})^{T}(\bm{y}-\bm{x})| =|[c𝒈]T[1𝒚𝒙][f(𝒙)f(𝒙)]T[1𝒚𝒙]|,\displaystyle=\left|\begin{bmatrix}c\\ \bm{g}\end{bmatrix}^{T}\begin{bmatrix}1\\ \bm{y}-\bm{x}\end{bmatrix}-\begin{bmatrix}f(\bm{x})\\ \nabla f(\bm{x})\end{bmatrix}^{T}\begin{bmatrix}1\\ \bm{y}-\bm{x}\end{bmatrix}\right|, (3.19)
=|t=1pαt(𝒚)([c𝒈]T[1𝒚t𝒙][f(𝒙)f(𝒙)]T[1𝒚t𝒙])|,\displaystyle=\left|\sum_{t=1}^{p}\alpha_{t}(\bm{y})\left(\begin{bmatrix}c\\ \bm{g}\end{bmatrix}^{T}\begin{bmatrix}1\\ \bm{y}_{t}-\bm{x}\end{bmatrix}-\begin{bmatrix}f(\bm{x})\\ \nabla f(\bm{x})\end{bmatrix}^{T}\begin{bmatrix}1\\ \bm{y}_{t}-\bm{x}\end{bmatrix}\right)\right|, (3.20)
=|t=1pαt(𝒚){𝒓T𝒆t+f(𝒚t)f(𝒙)f(𝒙)T(𝒚t𝒙)}|,\displaystyle=\left|\sum_{t=1}^{p}\alpha_{t}(\bm{y})\left\{\bm{r}^{T}\bm{e}_{t}+f(\bm{y}_{t})-f(\bm{x})-\nabla f(\bm{x})^{T}(\bm{y}_{t}-\bm{x})\right\}\right|, (3.21)
|𝜶(𝒚)T𝒓|+Lf2t=1p|αt(𝒚)|𝒚t𝒙2,\displaystyle\leq\left|\bm{\alpha}(\bm{y})^{T}\bm{r}\right|+\frac{L_{\nabla f}}{2}\sum_{t=1}^{p}|\alpha_{t}(\bm{y})|\cdot\|\bm{y}_{t}-\bm{x}\|^{2}, (3.22)
pΛLfβ22min(Δ,1)2,\displaystyle\leq\frac{p\Lambda L_{\nabla f}\beta^{2}}{2}\min(\Delta,1)^{2}, (3.23)

and we recover (3.9), where we used (3.18), |αt(𝒚)|Λ|\alpha_{t}(\bm{y})|\leq\Lambda and 𝒚t𝒙βmin(Δ,1)\|\bm{y}_{t}-\bm{x}\|\leq\beta\min(\Delta,1) to get the last inequality.

If 𝒙C\bm{x}\in C, we can take 𝒚=𝒙\bm{y}=\bm{x} in (3.9) to get (3.10). Combining (3.10) with (3.9) we get

|(𝒚𝒙)T(𝒈f(𝒙))|\displaystyle|(\bm{y}-\bm{x})^{T}(\bm{g}-\nabla f(\bm{x}))| |c+𝒈T(𝒚𝒙)f(𝒙)f(𝒙)T(𝒚𝒙)|+|cf(𝒙)|,\displaystyle\leq|c+\bm{g}^{T}(\bm{y}-\bm{x})-f(\bm{x})-\nabla f(\bm{x})^{T}(\bm{y}-\bm{x})|+|c-f(\bm{x})|, (3.24)
pΛLfβ2min(Δ,1)2,\displaystyle\leq p\Lambda L_{\nabla f}\beta^{2}\min(\Delta,1)^{2}, (3.25)

and we get (3.11). ∎

Theorem 3.4.

Suppose the assumptions of Lemma 3.3 hold and 𝐱C\bm{x}\in C. Then the regression model (3.1) is CC-fully linear in B(𝐱,Δ)B(\bm{x},\Delta) with constants

κef=pΛLfβ2+Lf2,andκeg=pΛLfβ2,\displaystyle\kappa_{\textnormal{ef}}=p\Lambda L_{\nabla f}\beta^{2}+\frac{L_{\nabla f}}{2},\qquad\text{and}\qquad\kappa_{\textnormal{eg}}=p\Lambda L_{\nabla f}\beta^{2}, (3.26)

in (2.4).

Proof.

Fix 𝒚B(𝒙,Δ)C\bm{y}\in B(\bm{x},\Delta)\cap C. We first derive κef\kappa_{\textnormal{ef}} by considering the cases Δ>1\Delta>1 and Δ1\Delta\leq 1 separately.

If Δ>1\Delta>1, then 𝒚^=𝒙+1Δ(𝒚𝒙)B(𝒙,1)=B(𝒙,min(Δ,1))\hat{\bm{y}}=\bm{x}+\frac{1}{\Delta}(\bm{y}-\bm{x})\in B(\bm{x},1)=B(\bm{x},\min(\Delta,1)). Since CC is convex and 𝒙,𝒚C\bm{x},\bm{y}\in C we also have 𝒚^C\hat{\bm{y}}\in C. Hence (3.9) gives

|c+𝒈T(𝒚^𝒙)f(𝒙)f(𝒙)T(𝒚^𝒙)|\displaystyle|c+\bm{g}^{T}(\hat{\bm{y}}-\bm{x})-f(\bm{x})-\nabla f(\bm{x})^{T}(\hat{\bm{y}}-\bm{x})| pΛLfβ22min(Δ,1)2,\displaystyle\leq\frac{p\Lambda L_{\nabla f}\beta^{2}}{2}\min(\Delta,1)^{2}, (3.27)

and so

|c+1Δ𝒈T(𝒚𝒙)f(𝒙)1Δf(𝒙)T(𝒚𝒙)|\displaystyle\left|c+\frac{1}{\Delta}\bm{g}^{T}(\bm{y}-\bm{x})-f(\bm{x})-\frac{1}{\Delta}\nabla f(\bm{x})^{T}(\bm{y}-\bm{x})\right| pΛLfβ22min(Δ,1)2.\displaystyle\leq\frac{p\Lambda L_{\nabla f}\beta^{2}}{2}\min(\Delta,1)^{2}. (3.28)

This gives us

|f(𝒚)m(𝒚)|\displaystyle|f(\bm{y})-m(\bm{y})| |f(𝒚)f(𝒙)f(𝒙)T(𝒚𝒙)|+|m(𝒚)f(𝒙)f(𝒙)T(𝒚𝒙)|,\displaystyle\leq|f(\bm{y})-f(\bm{x})-\nabla f(\bm{x})^{T}(\bm{y}-\bm{x})|+|m(\bm{y})-f(\bm{x})-\nabla f(\bm{x})^{T}(\bm{y}-\bm{x})|, (3.29)
Lf2𝒚𝒙2+Δ|1Δc+1Δ𝒈T(𝒚𝒙)1Δf(𝒙)1Δf(𝒙)T(𝒚𝒙)|,\displaystyle\leq\frac{L_{\nabla f}}{2}\|\bm{y}-\bm{x}\|^{2}+\Delta\left|\frac{1}{\Delta}c+\frac{1}{\Delta}\bm{g}^{T}(\bm{y}-\bm{x})-\frac{1}{\Delta}f(\bm{x})-\frac{1}{\Delta}\nabla f(\bm{x})^{T}(\bm{y}-\bm{x})\right|, (3.30)
Lf2𝒚𝒙2+Δ[|c+1Δ𝒈T(𝒚𝒙)f(𝒙)1Δf(𝒙)T(𝒚𝒙)|\displaystyle\leq\frac{L_{\nabla f}}{2}\|\bm{y}-\bm{x}\|^{2}+\Delta\left[\left|c+\frac{1}{\Delta}\bm{g}^{T}(\bm{y}-\bm{x})-f(\bm{x})-\frac{1}{\Delta}\nabla f(\bm{x})^{T}(\bm{y}-\bm{x})\right|\right.
+|(1Δ1)(cf(𝒙))|],\displaystyle\qquad\qquad\qquad\qquad\qquad\qquad\left.+\left|\left(\frac{1}{\Delta}-1\right)(c-f(\bm{x}))\right|\right], (3.31)
Lf2Δ2+Δ[pΛLfβ22min(Δ,1)2\displaystyle\leq\frac{L_{\nabla f}}{2}\Delta^{2}+\Delta\left[\frac{p\Lambda L_{\nabla f}\beta^{2}}{2}\min(\Delta,1)^{2}\right.
+(11Δ)pΛLfβ22min(Δ,1)2],\displaystyle\qquad\qquad\qquad\qquad\qquad\qquad\left.+\left(1-\frac{1}{\Delta}\right)\frac{p\Lambda L_{\nabla f}\beta^{2}}{2}\min(\Delta,1)^{2}\right], (3.32)
Lf2Δ2+pΛLfβ2Δmin(Δ,1)2,\displaystyle\leq\frac{L_{\nabla f}}{2}\Delta^{2}+p\Lambda L_{\nabla f}\beta^{2}\Delta\min(\Delta,1)^{2}, (3.33)
=Lf2Δ2+pΛLfβ2Δ2,\displaystyle=\frac{L_{\nabla f}}{2}\Delta^{2}+p\Lambda L_{\nabla f}\beta^{2}\Delta^{2}, (3.34)

where we use (3.10) to get the fourth inequality, and the last line follows from Δ>1\Delta>1. Instead if Δ1\Delta\leq 1, then 𝒚B(𝒙,min(Δ,1))C\bm{y}\in B(\bm{x},\min(\Delta,1))\cap C already, and so (3.9) immediately gives

|f(𝒚)m(𝒚)|\displaystyle|f(\bm{y})-m(\bm{y})| =|f(𝒚)f(𝒙)f(𝒙)T(𝒚𝒙)|+|m(𝒚)f(𝒙)f(𝒙)T(𝒚𝒙)|,\displaystyle=|f(\bm{y})-f(\bm{x})-\nabla f(\bm{x})^{T}(\bm{y}-\bm{x})|+\left|m(\bm{y})-f(\bm{x})-\nabla f(\bm{x})^{T}(\bm{y}-\bm{x})\right|, (3.35)
Lf2Δ2+pΛLfβ22min(Δ,1)2,\displaystyle\leq\frac{L_{\nabla f}}{2}\Delta^{2}+\frac{p\Lambda L_{\nabla f}\beta^{2}}{2}\min(\Delta,1)^{2}, (3.36)
Lf2Δ2+pΛLfβ22Δ2.\displaystyle\leq\frac{L_{\nabla f}}{2}\Delta^{2}+\frac{p\Lambda L_{\nabla f}\beta^{2}}{2}\Delta^{2}. (3.37)

Either way, we get the desired value of κef\kappa_{\textnormal{ef}}.

To get κeg\kappa_{\textnormal{eg}} we now fix an arbitrary 𝒚~B(𝒙,1)C\widetilde{\bm{y}}\in B(\bm{x},1)\cap C and again consider the cases Δ1\Delta\geq 1 and Δ<1\Delta<1 separately. First, if Δ1\Delta\geq 1, then 𝒚~B(𝒙,min(Δ,1))C\widetilde{\bm{y}}\in B(\bm{x},\min(\Delta,1))\cap C. From (3.11) we get

|(𝒚~𝒙)T(𝒈f(𝒙))|pΛLfβ2min(Δ,1)2pΛLfβ2Δ,\displaystyle|(\widetilde{\bm{y}}-\bm{x})^{T}(\bm{g}-\nabla f(\bm{x}))|\leq p\Lambda L_{\nabla f}\beta^{2}\min(\Delta,1)^{2}\leq p\Lambda L_{\nabla f}\beta^{2}\Delta, (3.38)

where the second inequality follows from Δ1\Delta\geq 1. Alternatively, if Δ<1\Delta<1 then the convexity of CC implies that 𝒚^:=𝒙+Δ(𝒚~𝒙)B(𝒙,Δ)C=B(𝒙,min(Δ,1))C\hat{\bm{y}}:=\bm{x}+\Delta(\widetilde{\bm{y}}-\bm{x})\in B(\bm{x},\Delta)\cap C=B(\bm{x},\min(\Delta,1))\cap C. Again from (3.11) we get

|(𝒚~𝒙)T(𝒈f(𝒙))|\displaystyle|(\widetilde{\bm{y}}-\bm{x})^{T}(\bm{g}-\nabla f(\bm{x}))| =1Δ|(𝒚^𝒙)T(𝒈f(𝒙))|,\displaystyle=\frac{1}{\Delta}|(\hat{\bm{y}}-\bm{x})^{T}(\bm{g}-\nabla f(\bm{x}))|, (3.39)
pΛLfβ2Δ1min(Δ,1)2,\displaystyle\leq p\Lambda L_{\nabla f}\beta^{2}\Delta^{-1}\min(\Delta,1)^{2}, (3.40)
=pΛLfβ2Δ,\displaystyle=p\Lambda L_{\nabla f}\beta^{2}\Delta, (3.41)

where the last line follows from Δ<1\Delta<1. Again, either way we get the desired value of κeg\kappa_{\textnormal{eg}}. ∎

4 Underdetermined Quadratic Interpolation Models

We now consider the case of forming CC-fully linear quadratic interpolation models. Our approach follows that of [26] for the unconstrained case, although the fully linear error bounds for this approach were shown later in [16]. We note that this approach is different to the underdetermined quadratic interpolation used in [14].

Here, we aim to construct a quadratic model

f(𝒚)m(𝒚):=c+𝒈T(𝒚𝒙)+12(𝒚𝒙)TH(𝒚𝒙),\displaystyle f(\bm{y})\approx m(\bm{y}):=c+\bm{g}^{T}(\bm{y}-\bm{x})+\frac{1}{2}(\bm{y}-\bm{x})^{T}H(\bm{y}-\bm{x}), (4.1)

where cc\in\mathbb{R}, 𝒈n\bm{g}\in\mathbb{R}^{n} and Hn×nH\in\mathbb{R}^{n\times n} with H=HTH=H^{T}. We assume that our interpolation set is {𝒚1,,𝒚p}\{\bm{y}_{1},\ldots,\bm{y}_{p}\} with p{n+2,,(n+1)(n+2)/2}p\in\{n+2,\ldots,(n+1)(n+2)/2\}. The case p=(n+1)(n+2)/2p=(n+1)(n+2)/2 corresponds to full quadratic interpolation [13, Section 4.2]. We exclude the case p=n+1p=n+1 which, from (4.2) below, corresponds to linear interpolation and was analyzed (in the convex-constrained case) in [22, Section 4].

If n+2p<(n+1)(n+2)/2n+2\leq p<(n+1)(n+2)/2 then there are infinitely many models satisfying the interpolation conditions f(𝒚t)=m(𝒚t)f(\bm{y}_{t})=m(\bm{y}_{t}) for all t=1,,pt=1,\ldots,p. So, following [26] we choose the model with minimum Frobenius norm Hessian by solving

minc,𝒈,H×n×n×n\displaystyle\min_{c,\bm{g},H\in\mathbb{R}\times\mathbb{R}^{n}\times\mathbb{R}^{n\times n}} 14HF2,\displaystyle\>\>\frac{1}{4}\|H\|_{F}^{2}, (4.2a)
s.t. f(𝒚t)=m(𝒚t),t=1,,p.\displaystyle f(\bm{y}_{t})=m(\bm{y}_{t}),\qquad\forall t=1,\ldots,p. (4.2b)

This is a convex quadratic program, and (as shown in [26]), reduces to solving the (p+n+1)×(p+n+1)(p+n+1)\times(p+n+1) linear system

F[λ1λpc𝒈]:=[QMMT0][λ1λpc𝒈]=[f(𝒚1)f(𝒚p)0𝟎]p+n+1,\displaystyle F\left[\begin{array}[]{c}\lambda_{1}\\ \vdots\\ \lambda_{p}\\ \hline\cr c\\ \bm{g}\end{array}\right]:=\left[\begin{array}[]{c|c}Q&M\\ \hline\cr M^{T}&0\end{array}\right]\left[\begin{array}[]{c}\lambda_{1}\\ \vdots\\ \lambda_{p}\\ \hline\cr c\\ \bm{g}\end{array}\right]=\left[\begin{array}[]{c}f(\bm{y}_{1})\\ \vdots\\ f(\bm{y}_{p})\\ \hline\cr 0\\ \bm{0}\end{array}\right]\in\mathbb{R}^{p+n+1}, (4.20)

where Mp×(n+1)M\in\mathbb{R}^{p\times(n+1)} is from the linear regression problem (3.3) and Qp×pQ\in\mathbb{R}^{p\times p} has entries Qi,j=12[(𝒚i𝒙)T(𝒚j𝒙)]2Q_{i,j}=\frac{1}{2}[(\bm{y}_{i}-\bm{x})^{T}(\bm{y}_{j}-\bm{x})]^{2} for i,j=1,,pi,j=1,\ldots,p. The solution to (4.20) immediately gives us cc and 𝒈\bm{g}; the (symmetric) model Hessian is given by H=t=1pλt(𝒚t𝒙)(𝒚t𝒙)TH=\sum_{t=1}^{p}\lambda_{t}(\bm{y}_{t}-\bm{x})(\bm{y}_{t}-\bm{x})^{T}. We note that QQ is symmetric positive semidefinite [26, eq. 2.10], and λ1,,λp\lambda_{1},\ldots,\lambda_{p} are the Lagrange multipliers associated with constraints (4.2b).

The Lagrange polynomials associated with our interpolation set are

t(𝒚):=ct+𝒈tT(𝒚𝒙)+12(𝒚𝒙)THt(𝒚𝒙),t=1,,p,\displaystyle\ell_{t}(\bm{y}):=c_{t}+\bm{g}_{t}^{T}(\bm{y}-\bm{x})+\frac{1}{2}(\bm{y}-\bm{x})^{T}H_{t}(\bm{y}-\bm{x}),\qquad\forall t=1,\ldots,p, (4.21)

where ctc_{t}, 𝒈t\bm{g}_{t} and HtH_{t} come from solving (4.2) with (4.2b) replaced by m(𝒚s)=δs,tm(\bm{y}_{s})=\delta_{s,t} for s=1,,ps=1,\ldots,p. Equivalently, we have

F[𝝀tct𝒈t]=𝒆t,t=1,,p,\displaystyle F\begin{bmatrix}\bm{\lambda}_{t}\\ c_{t}\\ \bm{g}_{t}\end{bmatrix}=\bm{e}_{t},\qquad\forall t=1,\ldots,p, (4.22)

and Ht=s=1p[𝝀t]s(𝒚s𝒙)(𝒚s𝒙)TH_{t}=\sum_{s=1}^{p}[\bm{\lambda}_{t}]_{s}(\bm{y}_{s}-\bm{x})(\bm{y}_{s}-\bm{x})^{T}. This gives

t(𝒚)\displaystyle\ell_{t}(\bm{y}) =ct+𝒈t(𝒚𝒙)+12s=1p[𝝀t]s[(𝒚𝒙)T(𝒚s𝒙)]2,\displaystyle=c_{t}+\bm{g}_{t}(\bm{y}-\bm{x})+\frac{1}{2}\sum_{s=1}^{p}[\bm{\lambda}_{t}]_{s}\left[(\bm{y}-\bm{x})^{T}(\bm{y}_{s}-\bm{x})\right]^{2}, (4.23)
=[𝝀tct𝒈t]T[{12[(𝒚𝒙)T(𝒚s𝒙)]2}s=1,,p1𝒚𝒙]=:ϕ(𝒚),\displaystyle=\begin{bmatrix}\bm{\lambda}_{t}\\ c_{t}\\ \bm{g}_{t}\end{bmatrix}^{T}\underbrace{\begin{bmatrix}\{\frac{1}{2}\left[(\bm{y}-\bm{x})^{T}(\bm{y}_{s}-\bm{x})\right]^{2}\}_{s=1,\ldots,p}\\ 1\\ \bm{y}-\bm{x}\end{bmatrix}}_{=:\bm{\phi}(\bm{y})}, (4.24)
=𝒆tTF1ϕ(𝒚).\displaystyle=\bm{e}_{t}^{T}F^{-1}\bm{\phi}(\bm{y}). (4.25)

Given (4.20) and (4.22), we conclude that

[λ1λp]=t=1pf(𝒚t)𝝀t,c=t=1pf(𝒚t)ct,and𝒈=t=1pf(𝒚t)𝒈t,\displaystyle\begin{bmatrix}\lambda_{1}\\ \vdots\\ \lambda_{p}\end{bmatrix}=\sum_{t=1}^{p}f(\bm{y}_{t})\bm{\lambda}_{t},\qquad c=\sum_{t=1}^{p}f(\bm{y}_{t})c_{t},\quad\text{and}\quad\bm{g}=\sum_{t=1}^{p}f(\bm{y}_{t})\bm{g}_{t}, (4.26)

which gives

m(𝒚)=t=1pf(𝒚t)t(𝒚).\displaystyle m(\bm{y})=\sum_{t=1}^{p}f(\bm{y}_{t})\ell_{t}(\bm{y}). (4.27)

We now have our new definition of Λ\Lambda-poisedness:

Definition 4.1.

Given Λ1\Lambda\geq 1, the interpolation set {𝒚1,,𝒚p}C\{\bm{y}_{1},\ldots,\bm{y}_{p}\}\subset C is Λ\Lambda-poised for minimum Frobenius norm interpolation in B(𝒙,Δ)CB(\bm{x},\Delta)\cap C if FF is invertible, and

maxt=1,,p|t(𝒚)|Λ,𝒚CB(𝒙,min(Δ,1)).\displaystyle\max_{t=1,\ldots,p}|\ell_{t}(\bm{y})|\leq\Lambda,\qquad\forall\bm{y}\in C\cap B(\bm{x},\min(\Delta,1)). (4.28)
Lemma 4.2.

If the set {𝐲1,,𝐲p}C\{\bm{y}_{1},\ldots,\bm{y}_{p}\}\subset C is Λ\Lambda-poised for minimum Frobenius norm interpolation, then it is pΛ\sqrt{p}\,\Lambda-poised for linear regression.

Proof.

Since FF is invertible, the sub-matrix MM has full column rank by [4, Theorem 3.3], and so {𝒚2𝒚1,,𝒚p𝒚1}\{\bm{y}_{2}-\bm{y}_{1},\ldots,\bm{y}_{p}-\bm{y}_{1}\} spans n\mathbb{R}^{n}. The remainder of this proof is based on the argument in [16, p. 83]. We note that since H=0H=0 is a global minimizer of the objective function (4.2), that if ff is linear then we have exact interpolation, m=fm=f. Applying this to the functions f(𝒚)=1f(\bm{y})=1 and f(𝒚)=(𝒚𝒙)T𝒆if(\bm{y})=(\bm{y}-\bm{x})^{T}\bm{e}_{i} for i=1,,ni=1,\ldots,n, we get from (4.27) that

1=t=1pt(𝒚),(𝒚𝒙)T𝒆i=t=1p(𝒚t𝒙)T𝒆it(𝒚),i=1,,n.\displaystyle 1=\sum_{t=1}^{p}\ell_{t}(\bm{y}),\qquad(\bm{y}-\bm{x})^{T}\bm{e}_{i}=\sum_{t=1}^{p}(\bm{y}_{t}-\bm{x})^{T}\bm{e}_{i}\cdot\ell_{t}(\bm{y}),\quad\forall i=1,\ldots,n. (4.29)

Denoting (𝒚)p\bm{\ell}(\bm{y})\in\mathbb{R}^{p} as the vector of all 1(𝒚),,p(𝒚)\ell_{1}(\bm{y}),\ldots,\ell_{p}(\bm{y}), these are equivalent to

MT(𝒚)=[1𝒚𝒙].\displaystyle M^{T}\bm{\ell}(\bm{y})=\begin{bmatrix}1\\ \bm{y}-\bm{x}\end{bmatrix}. (4.30)

Hence (𝒚)\bm{\ell}(\bm{y}) is another solution to the (underdetermined) system (3.15). Since the minimal norm solution to (3.15) was 𝜶(𝒚)\bm{\alpha}(\bm{y}), we must have 𝜶(𝒚)(𝒚)\|\bm{\alpha}(\bm{y})\|\leq\|\bm{\ell}(\bm{y})\|. However from (3.17) we know 𝜶(𝒚)=reg(𝒚)\bm{\alpha}(\bm{y})=\bm{\ell}^{\text{reg}}(\bm{y}), where reg(𝒚)\bm{\ell}^{\text{reg}}(\bm{y}) are the Lagrange polynomials associated with linear regression for {𝒚1,,𝒚p}\{\bm{y}_{1},\ldots,\bm{y}_{p}\}. Thus reg(𝒚)(𝒚)\|\bm{\ell}^{\text{reg}}(\bm{y})\|\leq\|\bm{\ell}(\bm{y})\|. Now, fixing any 𝒚B(𝒙,min(Δ,1))C\bm{y}\in B(\bm{x},\min(\Delta,1))\cap C we get

reg(𝒚)reg(𝒚)(𝒚)p(𝒚)pΛ,\displaystyle\|\bm{\ell}^{\text{reg}}(\bm{y})\|_{\infty}\leq\|\bm{\ell}^{\text{reg}}(\bm{y})\|\leq\|\bm{\ell}(\bm{y})\|\leq\sqrt{p}\,\|\bm{\ell}(\bm{y})\|_{\infty}\leq\sqrt{p}\,\Lambda, (4.31)

and we are done. ∎

We are now in a position to construct our fully linear error bounds. We will begin by proving a bound on the size of the model Hessian, which requires the following technical result.

Lemma 4.3.

Fix t>1t>1 and c1,c20c_{1},c_{2}\geq 0. If a,ba,b\in\mathbb{R} satisfy |a+b|c1|a+b|\leq c_{1} and |ta+b|c2|ta+b|\leq c_{2} then |a|(c1+c2)/(t1)|a|\leq(c_{1}+c_{2})/(t-1) and |b|(tc1+c2)/(t1)|b|\leq(tc_{1}+c_{2})/(t-1).

Proof.

We first prove the bound on |a||a|. To find a contradiction, first suppose that a>(c1+c2)/(t1)a>(c_{1}+c_{2})/(t-1) and so a>0a>0. Since a+bc1a+b\geq-c_{1} we have bc1ab\geq-c_{1}-a, which means ta+b(t1)ac1>c2ta+b\geq(t-1)a-c_{1}>c_{2}, contradicting |ta+b|c2|ta+b|\leq c_{2}. Instead, suppose that a<(c1+c2)/(t1)a<-(c_{1}+c_{2})/(t-1) and so a<0a<0. Then a+bc1a+b\leq c_{1} means bc1ab\leq c_{1}-a and so ta+b(t1)a+c1<c2ta+b\leq(t-1)a+c_{1}<-c_{2}, again contradicting |ta+b|c2|ta+b|\leq c_{2}.

The bound on |b||b| follows from similar reasoning. First suppose that b>(tc1+c2)/(t1)b>(tc_{1}+c_{2})/(t-1) and so b>0b>0. Since a+bc1a+b\leq c_{1} we have ac1ba\leq c_{1}-b and so ta+btc1(t1)b<c2ta+b\leq tc_{1}-(t-1)b<-c_{2}, contradicting |ta+b|c2|ta+b|\leq c_{2}. Instead suppose that b<(tc1+c2)/(t1)b<-(tc_{1}+c_{2})/(t-1) and so b<0b<0. Then since a+bc1a+b\geq-c_{1} we have ac1ba\geq-c_{1}-b and so ta+btc1(t1)b>c2ta+b\geq-tc_{1}-(t-1)b>c_{2}, again contradicting |ta+b|c2|ta+b|\leq c_{2}. ∎

We can now give our bound on the model Hessian. In the existing (unconstrained) theory, this is presented as a bound on H\|H\| [16, Theorem 5.7], but here we only need to consider specific Rayleigh-type quotients.

Lemma 4.4.

Suppose ff and CC satisfy Assumptions 2.1 and 2.2 respectively. Then if 𝐱C\bm{x}\in C, {𝐲1,,𝐲p}\{\bm{y}_{1},\ldots,\bm{y}_{p}\} is Λ\Lambda-poised for underdetermined quadratic interpolation in B(𝐱,Δ)CB(\bm{x},\Delta)\cap C and 𝐲t𝐱βmin(Δ,1)\|\bm{y}_{t}-\bm{x}\|\leq\beta\min(\Delta,1) for all t=1,,pt=1,\ldots,p and some β>0\beta>0, then the model mm generated by (4.2) has Hessian HH satisfying

maxs,t=1,,p|(𝒚s𝒙)TH(𝒚t𝒙)|β2min(Δ,1)2κH:=Lfp[8Λβ2+36Λβ+58Λ+6].\displaystyle\max_{s,t=1,\ldots,p}\frac{|(\bm{y}_{s}-\bm{x})^{T}H(\bm{y}_{t}-\bm{x})|}{\beta^{2}\min(\Delta,1)^{2}}\leq\kappa_{H}:=L_{\nabla f}p\left[8\Lambda\beta^{2}+36\Lambda\beta+58\Lambda+6\right]. (4.32)
Proof.

First, fix u{1,,p}u\in\{1,\ldots,p\} and consider the associated Lagrange polynomial

u(𝒚)=cu+𝒈uT(𝒚𝒙)+12(𝒚𝒙)THu(𝒚𝒙).\displaystyle\ell_{u}(\bm{y})=c_{u}+\bm{g}_{u}^{T}(\bm{y}-\bm{x})+\frac{1}{2}(\bm{y}-\bm{x})^{T}H_{u}(\bm{y}-\bm{x}). (4.33)

Additionally, fix s,t{1,,p}s,t\in\{1,\ldots,p\}, and we will first provide a bound on

|(𝒚s𝒙)THu(𝒚t𝒙)|.\displaystyle|(\bm{y}_{s}-\bm{x})^{T}H_{u}(\bm{y}_{t}-\bm{x})|. (4.34)

To do this, we consider the value of u\ell_{u} at five different points: 𝒙\bm{x}, 𝒚s\bm{y}_{s}, 𝒚t\bm{y}_{t}, plus

𝒚^s:=𝒙+1β^(𝒚s𝒙),and𝒚^t:=𝒙+1β^(𝒚t𝒙),\displaystyle\hat{\bm{y}}_{s}:=\bm{x}+\frac{1}{\hat{\beta}}(\bm{y}_{s}-\bm{x}),\qquad\text{and}\qquad\hat{\bm{y}}_{t}:=\bm{x}+\frac{1}{\hat{\beta}}(\bm{y}_{t}-\bm{x}), (4.35)

where β^:=max(β,2)\hat{\beta}:=\max(\beta,2). Since 𝒚s𝒙βmin(Δ,1)\|\bm{y}_{s}-\bm{x}\|\leq\beta\min(\Delta,1) we have 𝒚^s𝒙ββ^min(Δ,1)min(Δ,1)\|\hat{\bm{y}}_{s}-\bm{x}\|\leq\frac{\beta}{\hat{\beta}}\min(\Delta,1)\leq\min(\Delta,1), and 𝒚^sC\hat{\bm{y}}_{s}\in C by convexity, since 𝒙,𝒚sC\bm{x},\bm{y}_{s}\in C. Similarly, we also have 𝒚^tB(𝒙,min(Δ,1))C\hat{\bm{y}}_{t}\in B(\bm{x},\min(\Delta,1))\cap C. For these points, from the definition of Λ\Lambda-poisedness we know |u(𝒙)|,|u(𝒚^s)|,|u(𝒚^t)|Λ|\ell_{u}(\bm{x})|,|\ell_{u}(\hat{\bm{y}}_{s})|,|\ell_{u}(\hat{\bm{y}}_{t})|\leq\Lambda and by definition of Lagrange polynomials we have u(𝒚s),u(𝒚t){0,1}\ell_{u}(\bm{y}_{s}),\ell_{u}(\bm{y}_{t})\in\{0,1\} and so |u(𝒚s)|,|u(𝒚t)|1|\ell_{u}(\bm{y}_{s})|,|\ell_{u}(\bm{y}_{t})|\leq 1.

From |u(𝒙)|Λ|\ell_{u}(\bm{x})|\leq\Lambda we have |cu|Λ|c_{u}|\leq\Lambda, and so |u(𝒚^s)|Λ|\ell_{u}(\hat{\bm{y}}_{s})|\leq\Lambda implies

Λ\displaystyle\Lambda |cu+𝒈uT(𝒚^s𝒙)+12(𝒚^s𝒙)THu(𝒚^s𝒙)|,\displaystyle\geq\left|c_{u}+\bm{g}_{u}^{T}(\hat{\bm{y}}_{s}-\bm{x})+\frac{1}{2}(\hat{\bm{y}}_{s}-\bm{x})^{T}H_{u}(\hat{\bm{y}}_{s}-\bm{x})\right|, (4.36)
=|cu+1β^𝒈uT(𝒚s𝒙)+12β^2(𝒚s𝒙)THu(𝒚s𝒙)|,\displaystyle=\left|c_{u}+\frac{1}{\hat{\beta}}\bm{g}_{u}^{T}(\bm{y}_{s}-\bm{x})+\frac{1}{2\hat{\beta}^{2}}(\bm{y}_{s}-\bm{x})^{T}H_{u}(\bm{y}_{s}-\bm{x})\right|, (4.37)
1β^2|β^𝒈uT(𝒚s𝒙)+12(𝒚s𝒙)THu(𝒚s𝒙)||cu|,\displaystyle\geq\frac{1}{\hat{\beta}^{2}}\left|\hat{\beta}\bm{g}_{u}^{T}(\bm{y}_{s}-\bm{x})+\frac{1}{2}(\bm{y}_{s}-\bm{x})^{T}H_{u}(\bm{y}_{s}-\bm{x})\right|-|c_{u}|, (4.38)

where the last line follows from the reverse triangle inequality. Together with |u(𝒚s)|1|\ell_{u}(\bm{y}_{s})|\leq 1, we get

|𝒈uT(𝒚s𝒙)+12(𝒚s𝒙)THu(𝒚s𝒙)|\displaystyle\left|\bm{g}_{u}^{T}(\bm{y}_{s}-\bm{x})+\frac{1}{2}(\bm{y}_{s}-\bm{x})^{T}H_{u}(\bm{y}_{s}-\bm{x})\right| 1+|cu|Λ+1,\displaystyle\leq 1+|c_{u}|\leq\Lambda+1, (4.39)
|β^𝒈uT(𝒚s𝒙)+12(𝒚s𝒙)THu(𝒚s𝒙)|\displaystyle\left|\hat{\beta}\bm{g}_{u}^{T}(\bm{y}_{s}-\bm{x})+\frac{1}{2}(\bm{y}_{s}-\bm{x})^{T}H_{u}(\bm{y}_{s}-\bm{x})\right| 2Λβ^2.\displaystyle\leq 2\Lambda\hat{\beta}^{2}. (4.40)

Since β^>1\hat{\beta}>1 by definition, applying Lemma 4.3 we conclude

|𝒈uT(𝒚s𝒙)|\displaystyle\left|\bm{g}_{u}^{T}(\bm{y}_{s}-\bm{x})\right| Λ+1+2Λβ^2β^1,\displaystyle\leq\frac{\Lambda+1+2\Lambda\hat{\beta}^{2}}{\hat{\beta}-1}, (4.41)
12|(𝒚s𝒙)THu(𝒚s𝒙)|\displaystyle\frac{1}{2}\left|(\bm{y}_{s}-\bm{x})^{T}H_{u}(\bm{y}_{s}-\bm{x})\right| β^(Λ+1)+2Λβ^β^1.\displaystyle\leq\frac{\hat{\beta}(\Lambda+1)+2\Lambda\hat{\beta}}{\hat{\beta}-1}. (4.42)

Since ss was arbitrary, the same inequalities hold with 𝒚s\bm{y}_{s} replaced by 𝒚t\bm{y}_{t}.

Now, consider the point

𝒚^s,t:=𝒙+12(𝒚^s𝒙)+12(𝒚^t𝒙).\displaystyle\hat{\bm{y}}_{s,t}:=\bm{x}+\frac{1}{2}(\hat{\bm{y}}_{s}-\bm{x})+\frac{1}{2}(\hat{\bm{y}}_{t}-\bm{x}). (4.43)

Since 𝒙,𝒚^s,𝒚^tC\bm{x},\hat{\bm{y}}_{s},\hat{\bm{y}}_{t}\in C, we have 𝒚^s,tC\hat{\bm{y}}_{s,t}\in C, and also

𝒚^s,t𝒙12𝒚^s𝒙+12𝒚^t𝒙min(Δ,1),\displaystyle\|\hat{\bm{y}}_{s,t}-\bm{x}\|\leq\frac{1}{2}\|\hat{\bm{y}}_{s}-\bm{x}\|+\frac{1}{2}\|\hat{\bm{y}}_{t}-\bm{x}\|\leq\min(\Delta,1), (4.44)

and so |u(𝒚^s,t)|Λ|\ell_{u}(\hat{\bm{y}}_{s,t})|\leq\Lambda. Written in full, this is

|cu+12β^𝒈uT(𝒚s𝒙)+12β^𝒈uT(𝒚t𝒙)\displaystyle\left|c_{u}+\frac{1}{2\hat{\beta}}\bm{g}_{u}^{T}(\bm{y}_{s}-\bm{x})+\frac{1}{2\hat{\beta}}\bm{g}_{u}^{T}(\bm{y}_{t}-\bm{x})\right.
+12(12β^(𝒚s𝒙)+12β^(𝒚t𝒙))THu(12β^(𝒚s𝒙)+12β^(𝒚t𝒙))|Λ.\displaystyle\qquad\qquad\left.+\frac{1}{2}\left(\frac{1}{2\hat{\beta}}(\bm{y}_{s}-\bm{x})+\frac{1}{2\hat{\beta}}(\bm{y}_{t}-\bm{x})\right)^{T}H_{u}\left(\frac{1}{2\hat{\beta}}(\bm{y}_{s}-\bm{x})+\frac{1}{2\hat{\beta}}(\bm{y}_{t}-\bm{x})\right)\right|\leq\Lambda. (4.45)

That is,

14β^2|(𝒚s𝒙)THu(𝒚t𝒙)|\displaystyle\frac{1}{4\hat{\beta}^{2}}|(\bm{y}_{s}-\bm{x})^{T}H_{u}(\bm{y}_{t}-\bm{x})| Λ+|cu|+12β^(|𝒈uT(𝒚s𝒙)|+|𝒈uT(𝒚t𝒙)|)\displaystyle\leq\Lambda+|c_{u}|+\frac{1}{2\hat{\beta}}\left(|\bm{g}_{u}^{T}(\bm{y}_{s}-\bm{x})|+|\bm{g}_{u}^{T}(\bm{y}_{t}-\bm{x})|\right)
+18β^2(|(𝒚s𝒙)THu(𝒚s𝒙)|+|(𝒚t𝒙)THu(𝒚t𝒙)|).\displaystyle\qquad\qquad+\frac{1}{8\hat{\beta}^{2}}\left(|(\bm{y}_{s}-\bm{x})^{T}H_{u}(\bm{y}_{s}-\bm{x})|+|(\bm{y}_{t}-\bm{x})^{T}H_{u}(\bm{y}_{t}-\bm{x})|\right). (4.46)

Applying |cu|Λ|c_{u}|\leq\Lambda, (4.41) and (4.42), we conclude

14β^2|(𝒚s𝒙)THu(𝒚t𝒙)|\displaystyle\frac{1}{4\hat{\beta}^{2}}|(\bm{y}_{s}-\bm{x})^{T}H_{u}(\bm{y}_{t}-\bm{x})| 2Λ+Λ+1+2Λβ^2β^(β^1)+β^(Λ+1)+2Λβ^2β^2(β^1),\displaystyle\leq 2\Lambda+\frac{\Lambda+1+2\Lambda\hat{\beta}^{2}}{\hat{\beta}(\hat{\beta}-1)}+\frac{\hat{\beta}(\Lambda+1)+2\Lambda\hat{\beta}}{2\hat{\beta}^{2}(\hat{\beta}-1)}, (4.47)

or

|(𝒚s𝒙)THu(𝒚t𝒙)|\displaystyle|(\bm{y}_{s}-\bm{x})^{T}H_{u}(\bm{y}_{t}-\bm{x})| κ~,\displaystyle\leq\tilde{\kappa}, (4.48)

for all s,t,u=1,,ps,t,u=1,\ldots,p, where

κ~:=8Λβ^2+4Λβ^+4β^+8Λβ^3β^1+2β^(Λ+1)+4Λβ^β^1=8Λβ^2+8Λβ^3+10Λβ^+6β^β^1.\displaystyle\tilde{\kappa}:=8\Lambda\hat{\beta}^{2}+\frac{4\Lambda\hat{\beta}+4\hat{\beta}+8\Lambda\hat{\beta}^{3}}{\hat{\beta}-1}+\frac{2\hat{\beta}(\Lambda+1)+4\Lambda\hat{\beta}}{\hat{\beta}-1}=8\Lambda\hat{\beta}^{2}+\frac{8\Lambda\hat{\beta}^{3}+10\Lambda\hat{\beta}+6\hat{\beta}}{\hat{\beta}-1}. (4.49)

More simply, we have

κ~\displaystyle\tilde{\kappa} =8Λβ^2+8Λ(β^1)3+24Λ(β^1)2+(34Λ+6)(β^1)+18Λ+6β^1,\displaystyle=8\Lambda\hat{\beta}^{2}+\frac{8\Lambda(\hat{\beta}-1)^{3}+24\Lambda(\hat{\beta}-1)^{2}+(34\Lambda+6)(\hat{\beta}-1)+18\Lambda+6}{\hat{\beta}-1}, (4.50)
8Λ(β+2)2+8Λ(β+1)2+24Λ(β+1)+34Λ+6+18Λ+6,\displaystyle\leq 8\Lambda(\beta+2)^{2}+8\Lambda(\beta+1)^{2}+24\Lambda(\beta+1)+34\Lambda+6+18\Lambda+6, (4.51)
=16Λβ2+72Λβ+116Λ+12,\displaystyle=16\Lambda\beta^{2}+72\Lambda\beta+116\Lambda+12, (4.52)

using 2β^=max(β,2)β+22\leq\hat{\beta}=\max(\beta,2)\leq\beta+2 to get the inequality. Now let us turn our attention to the model mm (4.2). Note that we may add/subtract a linear function to f(𝒚)f(\bm{y}) without changing HH (since this just changes cc and 𝒈\bm{g} in (4.2)). So, we consider the model m~\widetilde{m} generated by interpolation to f~(𝒚):=f(𝒚)f(𝒙)f(𝒙)T(𝒚𝒙)\widetilde{f}(\bm{y}):=f(\bm{y})-f(\bm{x})-\nabla f(\bm{x})^{T}(\bm{y}-\bm{x}). From (4.27) we may write

H=H~=u=1pf~(𝒚u)Hu,\displaystyle H=\widetilde{H}=\sum_{u=1}^{p}\widetilde{f}(\bm{y}_{u})H_{u}, (4.53)

where HuH_{u} are the Hessians of the Lagrange polynomials. From Assumption 2.1, we also have |f~(𝒚)|Lf2𝒚𝒙2|\widetilde{f}(\bm{y})|\leq\frac{L_{\nabla f}}{2}\|\bm{y}-\bm{x}\|^{2}. Then for any s,t=1,,ps,t=1,\ldots,p we conclude

|(𝒚s𝒙)TH(𝒚t𝒙)|\displaystyle|(\bm{y}_{s}-\bm{x})^{T}H(\bm{y}_{t}-\bm{x})| u=1p|f~(𝒚u)||(𝒚s𝒙)THu(𝒚t𝒙)|,\displaystyle\leq\sum_{u=1}^{p}|\widetilde{f}(\bm{y}_{u})|\cdot|(\bm{y}_{s}-\bm{x})^{T}H_{u}(\bm{y}_{t}-\bm{x})|, (4.54)
u=1pLf2κ~𝒚u𝒙2,\displaystyle\leq\sum_{u=1}^{p}\frac{L_{\nabla f}}{2}\tilde{\kappa}\|\bm{y}_{u}-\bm{x}\|^{2}, (4.55)
Lf2κ~pβ2min(Δ,1)2.\displaystyle\leq\frac{L_{\nabla f}}{2}\tilde{\kappa}p\beta^{2}\min(\Delta,1)^{2}. (4.56)

and we are done after applying (4.52). ∎

Remark 4.5.

In the unconstrained case, [16, Theorem 5.7] gives the bound

H4(p+1)(n+1)(n+2)2LfΛmax(1,Δmax2),\displaystyle\|H\|\leq 4(p+1)\sqrt{\frac{(n+1)(n+2)}{2}}\>L_{\nabla f}\Lambda\max(1,\Delta_{\max}^{2}), (4.57)

where Δmax\Delta_{\max} is an upper bound for Δ\Delta. This result assumes 𝒚t𝒙Δ\|\bm{y}_{t}-\bm{x}\|\leq\Delta, and so to match our assumptions it requires 𝒚t𝒙βmin(Δ,1)\|\bm{y}_{t}-\bm{x}\|\leq\beta\min(\Delta,1) where β=Δmax\beta=\Delta_{\max}. Thus we see that Lemma 4.4 improves on [16, Theorem 5.7] by a factor of 𝒪(n)\mathcal{O}(n).

Next, we can prove an analogous result to Lemma 3.3.

Lemma 4.6.

Suppose the assumptions of Lemma 4.4 are satisfied. Then the model mm generated by (4.2) satisfies

|c+𝒈T(𝒚𝒙)f(𝒙)f(𝒙)T(𝒚𝒙)|12p3/2Λ(Lf+κH)β2min(Δ,1)2,\displaystyle\left|c+\bm{g}^{T}(\bm{y}-\bm{x})-f(\bm{x})-\nabla f(\bm{x})^{T}(\bm{y}-\bm{x})\right|\leq\frac{1}{2}p^{3/2}\Lambda(L_{\nabla f}+\kappa_{H})\beta^{2}\min(\Delta,1)^{2}, (4.58)

for all 𝐲B(𝐱,min(Δ,1))C\bm{y}\in B(\bm{x},\min(\Delta,1))\cap C. Furthermore if 𝐱C\bm{x}\in C, we have

|cf(𝒙)|12p3/2Λ(Lf+κH)β2min(Δ,1)2,\displaystyle|c-f(\bm{x})|\leq\frac{1}{2}p^{3/2}\Lambda(L_{\nabla f}+\kappa_{H})\beta^{2}\min(\Delta,1)^{2}, (4.59)

and

|(𝒈f(𝒙))T(𝒚𝒙)|p3/2Λ(Lf+κH)β2min(Δ,1)2,\displaystyle|(\bm{g}-\nabla f(\bm{x}))^{T}(\bm{y}-\bm{x})|\leq p^{3/2}\Lambda(L_{\nabla f}+\kappa_{H})\beta^{2}\min(\Delta,1)^{2}, (4.60)

for all 𝐲B(𝐱,min(Δ,1))C\bm{y}\in B(\bm{x},\min(\Delta,1))\cap C.

Proof.

Fix 𝒚B(𝒙,min(Δ,1))C\bm{y}\in B(\bm{x},\min(\Delta,1))\cap C. From Lemma 4.2, our interpolation set is pΛ\sqrt{p}\,\Lambda-poised for linear regression. Therefore there exist constants {αt(𝒚)}t=1p\{\alpha_{t}(\bm{y})\}_{t=1}^{p} such that

[1𝒚𝒙]=t=1pαt(𝒚)[1𝒚t𝒙],\displaystyle\begin{bmatrix}1\\ \bm{y}-\bm{x}\end{bmatrix}=\sum_{t=1}^{p}\alpha_{t}(\bm{y})\begin{bmatrix}1\\ \bm{y}_{t}-\bm{x}\end{bmatrix}, (4.61)

where |αt(𝒚)|=|treg(𝒚)|pΛ|\alpha_{t}(\bm{y})|=|\ell_{t}^{\text{reg}}(\bm{y})|\leq\sqrt{p}\,\Lambda (provided the minimal norm solution is taken as in (3.15)). Thus we have

|c+𝒈T(𝒚𝒙)f(𝒙)f(𝒙)T(𝒚𝒙)|\displaystyle\left|c+\bm{g}^{T}(\bm{y}-\bm{x})-f(\bm{x})-\nabla f(\bm{x})^{T}(\bm{y}-\bm{x})\right|
=|[cf(𝒙)𝒈f(𝒙)]T[1𝒚𝒙]|,\displaystyle\qquad\qquad=\left|\begin{bmatrix}c-f(\bm{x})\\ \bm{g}-\nabla f(\bm{x})\end{bmatrix}^{T}\begin{bmatrix}1\\ \bm{y}-\bm{x}\end{bmatrix}\right|, (4.62)
t=1ppΛ|[cf(𝒙)𝒈f(𝒙)]T[1𝒚t𝒙]|,\displaystyle\qquad\qquad\leq\sum_{t=1}^{p}\sqrt{p}\>\Lambda\left|\begin{bmatrix}c-f(\bm{x})\\ \bm{g}-\nabla f(\bm{x})\end{bmatrix}^{T}\begin{bmatrix}1\\ \bm{y}_{t}-\bm{x}\end{bmatrix}\right|, (4.63)
=pΛt=1p|m(𝒚t)f(𝒙)f(𝒙)T(𝒚t𝒙)12(𝒚t𝒙)TH(𝒚t𝒙)|,\displaystyle\qquad\qquad=\sqrt{p}\>\Lambda\sum_{t=1}^{p}\left|m(\bm{y}_{t})-f(\bm{x})-\nabla f(\bm{x})^{T}(\bm{y}_{t}-\bm{x})-\frac{1}{2}(\bm{y}_{t}-\bm{x})^{T}H(\bm{y}_{t}-\bm{x})\right|, (4.64)
pΛt=1p[|f(𝒚t)f(𝒙)f(𝒙)T(𝒚t𝒙)|+12|(𝒚t𝒙)TH(𝒚t𝒙)|],\displaystyle\qquad\qquad\leq\sqrt{p}\>\Lambda\sum_{t=1}^{p}\left[\left|f(\bm{y}_{t})-f(\bm{x})-\nabla f(\bm{x})^{T}(\bm{y}_{t}-\bm{x})\right|+\frac{1}{2}\left|(\bm{y}_{t}-\bm{x})^{T}H(\bm{y}_{t}-\bm{x})\right|\right], (4.65)
p3/2Λ(Lf2β2min(Δ,1)2+κH2β2min(Δ,1)2),\displaystyle\qquad\qquad\leq p^{3/2}\Lambda\left(\frac{L_{\nabla f}}{2}\beta^{2}\min(\Delta,1)^{2}+\frac{\kappa_{H}}{2}\beta^{2}\min(\Delta,1)^{2}\right), (4.66)

using Assumption 2.1 and Lemma 4.4 in the last line, and we have (4.58). To get (4.59), we just take 𝒚=𝒙\bm{y}=\bm{x} in (4.58). Finally, to get (4.60), we use

|(𝒈f(𝒙))T(𝒚𝒙)||c+𝒈T(𝒚𝒙)f(𝒙)f(𝒙)T(𝒚𝒙)|+|cf(𝒙)|,\displaystyle|(\bm{g}-\nabla f(\bm{x}))^{T}(\bm{y}-\bm{x})|\leq|c+\bm{g}^{T}(\bm{y}-\bm{x})-f(\bm{x})-\nabla f(\bm{x})^{T}(\bm{y}-\bm{x})|+|c-f(\bm{x})|, (4.67)

together with (4.58) and (4.59). ∎

Finally, we can give our fully linear error bounds for underdetermined quadratic interpolation models.

Theorem 4.7.

Suppose ff and CC satisfy Assumptions 2.1 and 2.2 respectively. Then if 𝐱C\bm{x}\in C, {𝐲1,,𝐲p}\{\bm{y}_{1},\ldots,\bm{y}_{p}\} is Λ\Lambda-poised for underdetermined quadratic interpolation in B(𝐱,Δ)CB(\bm{x},\Delta)\cap C and 𝐲t𝐱βmin(Δ,1)\|\bm{y}_{t}-\bm{x}\|\leq\beta\min(\Delta,1) for all t=1,,pt=1,\ldots,p and some β>0\beta>0, then the model mm generated by (4.2) is fully linear in B(𝐱,Δ)B(\bm{x},\Delta) (in the sense of Definition 2.4) with constants

κef=12Lf+32p3/2Λ(Lf+κH)β2+12pΛ2κHβ2,andκeg=p3/2Λ(Lf+κH)β2,\displaystyle\kappa_{\textnormal{ef}}=\frac{1}{2}L_{\nabla f}+\frac{3}{2}p^{3/2}\Lambda(L_{\nabla f}+\kappa_{H})\beta^{2}+\frac{1}{2}p\Lambda^{2}\kappa_{H}\beta^{2},\quad\text{and}\quad\kappa_{\textnormal{eg}}=p^{3/2}\Lambda(L_{\nabla f}+\kappa_{H})\beta^{2}, (4.68)

in (2.4), where κH\kappa_{H} is defined in Lemma 4.4.

Proof.

First fix 𝒚B(𝒙,Δ)C\bm{y}\in B(\bm{x},\Delta)\cap C and we will derive κef\kappa_{\textnormal{ef}} by considering the cases Δ1\Delta\geq 1 and Δ<1\Delta<1 separately. If Δ1\Delta\geq 1 then 𝒚^:=𝒙+Δ1(𝒚𝒙)\hat{\bm{y}}:=\bm{x}+\Delta^{-1}(\bm{y}-\bm{x}) is in B(𝒙,1)=B(𝒙,min(Δ,1))B(\bm{x},1)=B(\bm{x},\min(\Delta,1)) and 𝒚^C\hat{\bm{y}}\in C since CC is convex. We then apply Lemma 4.6 to get

|(𝒚𝒙)T(𝒈f(𝒙))|\displaystyle|(\bm{y}-\bm{x})^{T}(\bm{g}-\nabla f(\bm{x}))| =Δ|(𝒚^𝒙)T(𝒈f(𝒙))|,\displaystyle=\Delta|(\hat{\bm{y}}-\bm{x})^{T}(\bm{g}-\nabla f(\bm{x}))|, (4.69)
p3/2Λ(Lf+κH)β2Δmin(Δ,1)2,\displaystyle\leq p^{3/2}\Lambda(L_{\nabla f}+\kappa_{H})\beta^{2}\Delta\min(\Delta,1)^{2}, (4.70)
p3/2Λ(Lf+κH)β2Δ2,\displaystyle\leq p^{3/2}\Lambda(L_{\nabla f}+\kappa_{H})\beta^{2}\Delta^{2}, (4.71)

where the last inequality follows from min(Δ,1)2=1Δ\min(\Delta,1)^{2}=1\leq\Delta. Also, from (4.61) we have 𝒚^𝒙=t=1pαt(𝒚^)(𝒚t𝒙)\hat{\bm{y}}-\bm{x}=\sum_{t=1}^{p}\alpha_{t}(\hat{\bm{y}})(\bm{y}_{t}-\bm{x}) with |αt(𝒚)|pΛ|\alpha_{t}(\bm{y})|\leq\sqrt{p}\,\Lambda, and so from Lemma 4.4,

|(𝒚𝒙)TH(𝒚𝒙)|\displaystyle|(\bm{y}-\bm{x})^{T}H(\bm{y}-\bm{x})| =Δ2|(𝒚^𝒙)TH(𝒚^𝒙)|,\displaystyle=\Delta^{2}|(\hat{\bm{y}}-\bm{x})^{T}H(\hat{\bm{y}}-\bm{x})|, (4.72)
Δ2s,t=1p|αs(𝒚^)αt(𝒚^)(𝒚s𝒙)TH(𝒚t𝒙)|,\displaystyle\leq\Delta^{2}\sum_{s,t=1}^{p}\left|\alpha_{s}(\hat{\bm{y}})\alpha_{t}(\hat{\bm{y}})(\bm{y}_{s}-\bm{x})^{T}H(\bm{y}_{t}-\bm{x})\right|, (4.73)
pΛ2κHβ2Δ2min(Δ,1)2,\displaystyle\leq p\Lambda^{2}\kappa_{H}\beta^{2}\Delta^{2}\min(\Delta,1)^{2}, (4.74)
=pΛ2κHβ2Δ2,\displaystyle=p\Lambda^{2}\kappa_{H}\beta^{2}\Delta^{2}, (4.75)

again using Δ1\Delta\geq 1 in the last line.

Instead, if Δ<1\Delta<1 then 𝒚B(𝒙,min(Δ,1))C\bm{y}\in B(\bm{x},\min(\Delta,1))\cap C and we apply Lemma 4.6 directly to get

|(𝒚𝒙)T(𝒈f(𝒙))|\displaystyle|(\bm{y}-\bm{x})^{T}(\bm{g}-\nabla f(\bm{x}))| 12p3/2Λ(Lf+κH)β2min(Δ,1)2,\displaystyle\leq\frac{1}{2}p^{3/2}\Lambda(L_{\nabla f}+\kappa_{H})\beta^{2}\min(\Delta,1)^{2}, (4.76)
=p3/2Λ(Lf+κH)β2Δ2.\displaystyle=p^{3/2}\Lambda(L_{\nabla f}+\kappa_{H})\beta^{2}\Delta^{2}. (4.77)

Also, using Lemma 4.4 we have

|(𝒚𝒙)TH(𝒚𝒙)|\displaystyle|(\bm{y}-\bm{x})^{T}H(\bm{y}-\bm{x})| t=1p|αs(𝒚)αt(𝒚)(𝒚s𝒙)TH(𝒚t𝒙)|,\displaystyle\leq\sum_{t=1}^{p}\left|\alpha_{s}(\bm{y})\alpha_{t}(\bm{y})(\bm{y}_{s}-\bm{x})^{T}H(\bm{y}_{t}-\bm{x})\right|, (4.78)
pΛ2κHβ2min(Δ,1)2,\displaystyle\leq p\Lambda^{2}\kappa_{H}\beta^{2}\min(\Delta,1)^{2}, (4.79)
=pΛ2κHβ2Δ2.\displaystyle=p\Lambda^{2}\kappa_{H}\beta^{2}\Delta^{2}. (4.80)

In either case, we use (4.71) and (4.75), or (4.77) and (4.80), with Assumption 2.1 and Lemma 4.6 to get

|f(𝒚)m(𝒚)|\displaystyle|f(\bm{y})-m(\bm{y})| |f(𝒚)f(𝒙)f(𝒙)T(𝒚𝒙)|\displaystyle\leq|f(\bm{y})-f(\bm{x})-\nabla f(\bm{x})^{T}(\bm{y}-\bm{x})|
+|c+𝒈T(𝒚𝒙)+12(𝒚𝒙)TH(𝒚𝒙)f(𝒙)f(𝒙)T(𝒚𝒙)|,\displaystyle\qquad+|c+\bm{g}^{T}(\bm{y}-\bm{x})+\frac{1}{2}(\bm{y}-\bm{x})^{T}H(\bm{y}-\bm{x})-f(\bm{x})-\nabla f(\bm{x})^{T}(\bm{y}-\bm{x})|, (4.81)
12LfΔ2+|cf(𝒙)|+|(𝒚𝒙)T(𝒈f(𝒙))|+12|(𝒚𝒙)TH(𝒚𝒙)|,\displaystyle\leq\frac{1}{2}L_{\nabla f}\Delta^{2}+|c-f(\bm{x})|+|(\bm{y}-\bm{x})^{T}(\bm{g}-\nabla f(\bm{x}))|+\frac{1}{2}|(\bm{y}-\bm{x})^{T}H(\bm{y}-\bm{x})|, (4.82)
12LfΔ2+12p3/2Λ(Lf+κH)β2min(Δ,1)2\displaystyle\leq\frac{1}{2}L_{\nabla f}\Delta^{2}+\frac{1}{2}p^{3/2}\Lambda(L_{\nabla f}+\kappa_{H})\beta^{2}\min(\Delta,1)^{2}
+p3/2Λ(Lf+κH)β2Δ2+12pΛ2κHβ2Δ2,\displaystyle\qquad\qquad\qquad\qquad+p^{3/2}\Lambda(L_{\nabla f}+\kappa_{H})\beta^{2}\Delta^{2}+\frac{1}{2}p\Lambda^{2}\kappa_{H}\beta^{2}\Delta^{2}, (4.83)

and we get the value of κef\kappa_{\textnormal{ef}} after min(Δ,1)Δ\min(\Delta,1)\leq\Delta.

To get κeg\kappa_{\textnormal{eg}} we now fix an arbitrary 𝒚~B(𝒙,1)C\widetilde{\bm{y}}\in B(\bm{x},1)\cap C and again consider the cases Δ1\Delta\geq 1 and Δ<1\Delta<1 separately. First, if Δ1\Delta\geq 1, then 𝒚~B(𝒙,min(Δ,1))C\widetilde{\bm{y}}\in B(\bm{x},\min(\Delta,1))\cap C, and applying Lemma 4.6 we get

|(𝒚~𝒙)T(𝒈f(𝒙))|\displaystyle|(\widetilde{\bm{y}}-\bm{x})^{T}(\bm{g}-\nabla f(\bm{x}))| p3/2Λ(Lf+κH)β2min(Δ,1)2,\displaystyle\leq p^{3/2}\Lambda(L_{\nabla f}+\kappa_{H})\beta^{2}\min(\Delta,1)^{2}, (4.84)
p3/2Λ(Lf+κH)β2Δ,\displaystyle\leq p^{3/2}\Lambda(L_{\nabla f}+\kappa_{H})\beta^{2}\Delta, (4.85)

since min(Δ,1)2=1Δ\min(\Delta,1)^{2}=1\leq\Delta. Alternatively, if Δ<1\Delta<1 then the convexity of CC implies that 𝒚^:=𝒙+Δ(𝒚~𝒙)B(𝒙,Δ)C=B(𝒙,min(Δ,1))C\hat{\bm{y}}:=\bm{x}+\Delta(\widetilde{\bm{y}}-\bm{x})\in B(\bm{x},\Delta)\cap C=B(\bm{x},\min(\Delta,1))\cap C. Again we apply Lemma 4.6 and get

|(𝒚~𝒙)T(𝒈f(𝒙))|\displaystyle|(\widetilde{\bm{y}}-\bm{x})^{T}(\bm{g}-\nabla f(\bm{x}))| =Δ1|(𝒚^𝒙)T(𝒈f(𝒙))|,\displaystyle=\Delta^{-1}|(\hat{\bm{y}}-\bm{x})^{T}(\bm{g}-\nabla f(\bm{x}))|, (4.86)
p3/2Λ(Lf+κH)β2Δ1min(Δ,1)2,\displaystyle\leq p^{3/2}\Lambda(L_{\nabla f}+\kappa_{H})\beta^{2}\Delta^{-1}\min(\Delta,1)^{2}, (4.87)
=p3/2Λ(Lf+κH)β2Δ.\displaystyle=p^{3/2}\Lambda(L_{\nabla f}+\kappa_{H})\beta^{2}\Delta. (4.88)

The value for κeg\kappa_{\textnormal{eg}} then follows from (4.85) and (4.88). ∎

Remark 4.8.

The fully linear error bounds for the unconstrained case [14, 16] give κef,κeg=𝒪(p3/2Λ(Lf+κH))\kappa_{\textnormal{ef}},\kappa_{\textnormal{eg}}=\mathcal{O}(p^{3/2}\Lambda(L_{\nabla f}+\kappa_{H})), so Theorem 4.7 matches the bound for κeg\kappa_{\textnormal{eg}} up to a factor of β2\beta^{2} (but with a value of κH\kappa_{H} which is 𝒪(n)\mathcal{O}(n) smaller than the unconstrained case; see Remark 4.5). Our value of κef\kappa_{\textnormal{ef}} has an extra term of size 𝒪(pΛ2κHβ2)\mathcal{O}(p\Lambda^{2}\kappa_{H}\beta^{2}) and so may be larger depending on the relative sizes of pp and Λ\Lambda.

5 Constructing 𝚲\bm{\Lambda}-Poised Quadratic Models

To meet all the requirements of Algorithm 1, we require procedures which can verify whether or not a given model is CC-fully linear, and if not, modify it to be CC-fully linear. From Theorem 4.7 it is clear that it suffices to verify/ensure that {𝒚1,,𝒚p}\{\bm{y}_{1},\ldots,\bm{y}_{p}\} is Λ\Lambda-poised for underdetermined quadratic interpolation in B(𝒙,Δ)CB(\bm{x},\Delta)\cap C and 𝒚t𝒙βmin(Δ,1)\|\bm{y}_{t}-\bm{x}\|\leq\beta\min(\Delta,1) for all t=0,,pt=0,\ldots,p.

In the first case—verifying whether not the interpolation set is Λ\Lambda-poised—we may follow the approach in [22, Section 4.3.1] and simply maximize/minimize each t\ell_{t} in B(𝒙,min(Δ,1))CB(\bm{x},\min(\Delta,1))\cap C. Checking 𝒚t𝒙βmin(Δ,1)\|\bm{y}_{t}-\bm{x}\|\leq\beta\min(\Delta,1) is straightforward. Although maximizing/minimizing t\ell_{t}, a possibly nonconvex quadratic function, in B(𝒙,min(Δ,1))CB(\bm{x},\min(\Delta,1))\cap C may not be straightforward depending on the structure of CC, we only need to know if the solution is above/below Λ\Lambda and not the exact solution. Existing solvers, including potentially global optimization solvers, are sufficient for this.

We now consider the situation where {𝒚1,,𝒚p}\{\bm{y}_{1},\ldots,\bm{y}_{p}\} is not Λ\Lambda-poised, and we wish to construct a new interpolation set which is Λ\Lambda-poised. There are two possible scenarios:

  • The associated matrix FF in (4.20) is not invertible; or

  • The matrix FF is invertible but |t(𝒚)|>Λ|\ell_{t}(\bm{y})|>\Lambda for some t=1,,pt=1,\ldots,p and 𝒚B(𝒙,min(Δ,1))C\bm{y}\in B(\bm{x},\min(\Delta,1))\cap C.

We will describe how to handle both of these situations in Section 5.2, but we will first need a technical result about how the matrix FF in (4.20) changes as interpolation points are changed.

5.1 Updating the matrix 𝑭\bm{F}

In this section we analyze changes in the determinant of the matrix FF as interpolation points are updated. These results are based on ideas from [26, Section 4], which considers changes in F1F^{-1} from updating interpolation points, rather than det(F)\det(F).

Lemma 5.1.

If AA is a symmetric, invertible matrix and we form A~\widetilde{A} by changing the tt-th row and column of AA to 𝐯~\widetilde{\bm{v}}, then

det(A~)=[(𝒆tTA1𝒗~)2+(𝒆tTA1𝒆t)𝒗~T(𝒆tA1𝒗~)]det(A).\displaystyle\det(\widetilde{A})=\left[(\bm{e}_{t}^{T}A^{-1}\widetilde{\bm{v}})^{2}+(\bm{e}_{t}^{T}A^{-1}\bm{e}_{t})\widetilde{\bm{v}}^{T}(\bm{e}_{t}-A^{-1}\widetilde{\bm{v}})\right]\det(A). (5.1)
Proof.

Denote the tt-th row and column of AA by 𝒗:=A𝒆t\bm{v}:=A\bm{e}_{t}. We first note that A~\widetilde{A} can be written as a symmetric rank-3 update of AA:

A~=A+(𝒗~𝒗)𝒆tT+𝒆t(𝒗~𝒗)T(v~tvt)𝒆t𝒆tT.\displaystyle\widetilde{A}=A+(\widetilde{\bm{v}}-\bm{v})\bm{e}_{t}^{T}+\bm{e}_{t}(\widetilde{\bm{v}}-\bm{v})^{T}-(\widetilde{v}_{t}-v_{t})\bm{e}_{t}\bm{e}_{t}^{T}. (5.2)

The first two update terms add 𝒗~𝒗\widetilde{\bm{v}}-\bm{v} to the tt-th column and row respectively, and the last term corrects for the double update of At,tA_{t,t} from the first two updates. We can then apply the matrix determinant lemma222That is, det(A+UVT)=det(I+VTA1U)det(A)\det(A+UV^{T})=\det(I+V^{T}A^{-1}U)\det(A) if AA is invertible, e.g. [21, Corollary 18.1.4]. to get

det(A~)det(A)\displaystyle\frac{\det(\widetilde{A})}{\det(A)} =det([1+𝒆tTA1(𝒗~𝒗)𝒆tTA1𝒆t𝒆tTA1𝒆t(𝒗~𝒗)TA1(𝒗~𝒗)1+(𝒗~𝒗)TA1𝒆t(𝒗~𝒗)TA1𝒆t(v~tvt)𝒆tTA1(𝒗~𝒗)(v~tvt)𝒆tTA1𝒆t1(v~tvt)𝒆tTA1𝒆t]),\displaystyle=\det\left(\begin{bmatrix}1+\bm{e}_{t}^{T}A^{-1}(\widetilde{\bm{v}}-\bm{v})&\bm{e}_{t}^{T}A^{-1}\bm{e}_{t}&\bm{e}_{t}^{T}A^{-1}\bm{e}_{t}\\ (\widetilde{\bm{v}}-\bm{v})^{T}A^{-1}(\widetilde{\bm{v}}-\bm{v})&1+(\widetilde{\bm{v}}-\bm{v})^{T}A^{-1}\bm{e}_{t}&(\widetilde{\bm{v}}-\bm{v})^{T}A^{-1}\bm{e}_{t}\\ -(\widetilde{v}_{t}-v_{t})\bm{e}_{t}^{T}A^{-1}(\widetilde{\bm{v}}-\bm{v})&-(\widetilde{v}_{t}-v_{t})\bm{e}_{t}^{T}A^{-1}\bm{e}_{t}&1-(\widetilde{v}_{t}-v_{t})\bm{e}_{t}^{T}A^{-1}\bm{e}_{t}\end{bmatrix}\right), (5.3)
=det([1+τααβ1+ττγτγα1γα]),\displaystyle=\det\left(\begin{bmatrix}1+\tau&\alpha&\alpha\\ \beta&1+\tau&\tau\\ -\gamma\tau&-\gamma\alpha&1-\gamma\alpha\end{bmatrix}\right), (5.4)

where α:=𝒆tTA1𝒆t\alpha:=\bm{e}_{t}^{T}A^{-1}\bm{e}_{t}, β:=(𝒗~𝒗)TA1(𝒗~𝒗)\beta:=(\widetilde{\bm{v}}-\bm{v})^{T}A^{-1}(\widetilde{\bm{v}}-\bm{v}), γ:=v~tvt\gamma:=\widetilde{v}_{t}-v_{t} and τ:=𝒆tTA1(𝒗~𝒗)=(𝒗~𝒗)TA1𝒆t\tau:=\bm{e}_{t}^{T}A^{-1}(\widetilde{\bm{v}}-\bm{v})=(\widetilde{\bm{v}}-\bm{v})^{T}A^{-1}\bm{e}_{t} since A1A^{-1} is symmetric. Directly computing, we get

det(A~)det(A)\displaystyle\frac{\det(\widetilde{A})}{\det(A)} =(1+τ)[(1+τ)(1γα)+τγα]β[α(1γα)+γα2]γτ[ατα(1+τ)],\displaystyle=(1+\tau)\left[(1+\tau)(1-\gamma\alpha)+\tau\gamma\alpha\right]-\beta\left[\alpha(1-\gamma\alpha)+\gamma\alpha^{2}\right]-\gamma\tau\left[\alpha\tau-\alpha(1+\tau)\right], (5.5)
=(1+τ)[1+τγα]βα+γτα,\displaystyle=(1+\tau)\left[1+\tau-\gamma\alpha\right]-\beta\alpha+\gamma\tau\alpha, (5.6)
=(1+τ)2α(γ+β).\displaystyle=(1+\tau)^{2}-\alpha(\gamma+\beta). (5.7)

Finally, since 𝒗=A𝒆t\bm{v}=A\bm{e}_{t}, we get

τ=𝒆tTA1(𝒗~𝒗)=𝒆tTA1𝒗~𝒆tTA1𝒗=𝒆t=𝒆tTA1𝒗~1,\displaystyle\tau=\bm{e}_{t}^{T}A^{-1}(\widetilde{\bm{v}}-\bm{v})=\bm{e}_{t}^{T}A^{-1}\widetilde{\bm{v}}-\bm{e}_{t}^{T}\underbrace{A^{-1}\bm{v}}_{=\bm{e}_{t}}=\bm{e}_{t}^{T}A^{-1}\widetilde{\bm{v}}-1, (5.8)

and

γ+β\displaystyle\gamma+\beta =(𝒗~𝒗)T𝒆t+(𝒗~𝒗)TA1(𝒗~𝒗),\displaystyle=(\widetilde{\bm{v}}-\bm{v})^{T}\bm{e}_{t}+(\widetilde{\bm{v}}-\bm{v})^{T}A^{-1}(\widetilde{\bm{v}}-\bm{v}), (5.9)
=𝒗~T𝒆t𝒗T𝒆t+𝒗~TA1𝒗~2𝒗~TA1𝒗+𝒗TA1𝒗,\displaystyle=\widetilde{\bm{v}}^{T}\bm{e}_{t}-\bm{v}^{T}\bm{e}_{t}+\widetilde{\bm{v}}^{T}A^{-1}\widetilde{\bm{v}}-2\widetilde{\bm{v}}^{T}A^{-1}\bm{v}+\bm{v}^{T}A^{-1}\bm{v}, (5.10)
=𝒗~T𝒆t𝒗T𝒆t+𝒗~TA1𝒗~2𝒗~T𝒆t+𝒗T𝒆t,\displaystyle=\widetilde{\bm{v}}^{T}\bm{e}_{t}-\bm{v}^{T}\bm{e}_{t}+\widetilde{\bm{v}}^{T}A^{-1}\widetilde{\bm{v}}-2\widetilde{\bm{v}}^{T}\bm{e}_{t}+\bm{v}^{T}\bm{e}_{t}, (5.11)
=𝒗~T(A1𝒗~𝒆t).\displaystyle=\widetilde{\bm{v}}^{T}(A^{-1}\widetilde{\bm{v}}-\bm{e}_{t}). (5.12)

Our result then follows from (5.7) combined with the definition of α\alpha, (5.8) and (5.12). ∎

Theorem 5.2.

Suppose the interpolation set {𝐲1,,𝐲p}n\{\bm{y}_{1},\ldots,\bm{y}_{p}\}\in\mathbb{R}^{n} is such that the matrix FF in (4.20) is invertible. If the point 𝐲t\bm{y}_{t} for some t{1,,p}t\in\{1,\ldots,p\} is changed to a new point 𝐲n\bm{y}\in\mathbb{R}^{n}, then the new interpolation set yields a new matrix F~\widetilde{F} in (4.20) which satisfies

|det(F~)|t(𝒚)2|det(F)|.\displaystyle|\det(\widetilde{F})|\geq\ell_{t}(\bm{y})^{2}|\det(F)|. (5.13)
Proof.

If t(𝒚)=0\ell_{t}(\bm{y})=0, then the result holds trivially, so we assume without loss of generality that t(𝒚)0\ell_{t}(\bm{y})\neq 0.

Changing the interpolation point 𝒚t\bm{y}_{t} to 𝒚\bm{y} requires replacing the tt-th row and column of FF by 𝒗~\widetilde{\bm{v}}, where

𝒗~s\displaystyle\widetilde{\bm{v}}_{s} =12[(𝒚𝒙)T(𝒚s𝒙)]2,if s{1,,p} except for s=t,\displaystyle=\frac{1}{2}\left[(\bm{y}-\bm{x})^{T}(\bm{y}_{s}-\bm{x})\right]^{2},\qquad\text{if $s\in\{1,\ldots,p\}$ except for $s=t$}, (5.14a)
𝒗~t\displaystyle\widetilde{\bm{v}}_{t} =12[(𝒚𝒙)T(𝒚𝒙)]2,\displaystyle=\frac{1}{2}\left[(\bm{y}-\bm{x})^{T}(\bm{y}-\bm{x})\right]^{2}, (5.14b)
𝒗~p+2\displaystyle\widetilde{\bm{v}}_{p+2} =1,\displaystyle=1, (5.14c)
𝒗~p+2+i\displaystyle\widetilde{\bm{v}}_{p+2+i} =[𝒚𝒙]i,i=1,,n.\displaystyle=[\bm{y}-\bm{x}]_{i},\qquad\forall i=1,\ldots,n. (5.14d)

That is, 𝒗~\widetilde{\bm{v}} is the same as ϕ(𝒚)\bm{\phi}(\bm{y}) from (4.25) except for the tt-th entry. Specifically, we have

𝒗~\displaystyle\widetilde{\bm{v}} =ϕ(𝒚)+ηt𝒆t,\displaystyle=\bm{\phi}(\bm{y})+\eta_{t}\bm{e}_{t}, (5.15)

where

ηt\displaystyle\eta_{t} =12[(𝒚𝒙)T(𝒚𝒙)]212[(𝒚𝒙)T(𝒚t𝒙)]2=12𝒚𝒙4ϕ(𝒚)T𝒆t.\displaystyle=\frac{1}{2}\left[(\bm{y}-\bm{x})^{T}(\bm{y}-\bm{x})\right]^{2}-\frac{1}{2}\left[(\bm{y}-\bm{x})^{T}(\bm{y}_{t}-\bm{x})\right]^{2}=\frac{1}{2}\|\bm{y}-\bm{x}\|^{4}-\bm{\phi}(\bm{y})^{T}\bm{e}_{t}. (5.16)

Given this, Lemma 5.1 gives us (denoting αt:=𝒆tTF1𝒆t\alpha_{t}:=\bm{e}_{t}^{T}F^{-1}\bm{e}_{t} for notational convenience)

det(F~)det(F)\displaystyle\frac{\det(\widetilde{F})}{\det(F)} =(𝒆tTF1ϕ(𝒚)+ηt𝒆tTF1𝒆t)2+(𝒆tTF1𝒆t)(ϕ(𝒚)+ηt𝒆t)T(𝒆tF1(ϕ(𝒚)+ηt𝒆t)),\displaystyle=\left(\bm{e}_{t}^{T}F^{-1}\bm{\phi}(\bm{y})+\eta_{t}\bm{e}_{t}^{T}F^{-1}\bm{e}_{t}\right)^{2}+(\bm{e}_{t}^{T}F^{-1}\bm{e}_{t})(\bm{\phi}(\bm{y})+\eta_{t}\bm{e}_{t})^{T}(\bm{e}_{t}-F^{-1}(\bm{\phi}(\bm{y})+\eta_{t}\bm{e}_{t})), (5.17)
=(t(𝒚)+ηtαt)2+αt(ϕ(𝒚)T𝒆tϕ(𝒚)TF1ϕ(𝒚)ηtt(𝒚)+ηtηtt(𝒚)ηt2αt),\displaystyle=\left(\ell_{t}(\bm{y})+\eta_{t}\alpha_{t}\right)^{2}+\alpha_{t}\left(\bm{\phi}(\bm{y})^{T}\bm{e}_{t}-\bm{\phi}(\bm{y})^{T}F^{-1}\bm{\phi}(\bm{y})-\eta_{t}\ell_{t}(\bm{y})+\eta_{t}-\eta_{t}\ell_{t}(\bm{y})-\eta_{t}^{2}\alpha_{t}\right), (5.18)
=t(𝒚)2+αt(ϕ(𝒚)T𝒆tϕ(𝒚)TF1ϕ(𝒚)+ηt).\displaystyle=\ell_{t}(\bm{y})^{2}+\alpha_{t}\left(\bm{\phi}(\bm{y})^{T}\bm{e}_{t}-\bm{\phi}(\bm{y})^{T}F^{-1}\bm{\phi}(\bm{y})+\eta_{t}\right). (5.19)

So from the definition of ηt\eta_{t}, we get

det(F~)\displaystyle\det(\widetilde{F}) =(t(𝒚)2+αtβt)det(F),\displaystyle=\left(\ell_{t}(\bm{y})^{2}+\alpha_{t}\beta_{t}\right)\>\det(F), (5.20)

where βt:=12𝒚𝒙4ϕ(𝒚)TF1ϕ(𝒚)\beta_{t}:=\frac{1}{2}\|\bm{y}-\bm{x}\|^{4}-\bm{\phi}(\bm{y})^{T}F^{-1}\bm{\phi}(\bm{y}). Finally, [26, Lemma 2] gives us αt,βt0\alpha_{t},\beta_{t}\geq 0 provided t(𝒚)0\ell_{t}(\bm{y})\neq 0, which gives us the result.333That t(𝒚)0\ell_{t}(\bm{y})\neq 0 is not explicitly assumed in the statement of [26, Lemma 2] but is required as the proof relies on [26, Theorem, p. 196].

Remark 5.3.

Theorem 5.2 does not require any of the points 𝒚1,,𝒚p\bm{y}_{1},\ldots,\bm{y}_{p} or 𝒚\bm{y} to be in CC. We will use this when constructing interpolation sets for which FF is invertible (Algorithm 2).

Remark 5.4.

A similar analysis to the above holds for linear interpolation (p=np=n), where Theorem 5.2 simplifies to |det(F~)|=|t(𝒚)||det(F)||\det(\widetilde{F})|=|\ell_{t}(\bm{y})|\cdot|\det(F)| as shown in [8, Section 4.2]. This allows us to use an approach like Algorithm 3 below to construct Λ\Lambda-poised sets for linear interpolation, a simpler method than the one outlined in [22, Section 4.3].

5.2 Constructing 𝚲\bm{\Lambda}-Poised Sets

We begin by describing how to construct an initial interpolation set, contained in B(𝒙,min(Δ,1))CB(\bm{x},\min(\Delta,1))\cap C for which FF is invertible. The full process for this is given in Algorithm 2. We first generate an initial set {𝒚1,,𝒚p}\{\bm{y}_{1},\ldots,\bm{y}_{p}\} in B(𝒙,min(Δ,1))B(\bm{x},\min(\Delta,1)), but not necessarily in CC, for which FF is invertible. We can do this using the approach outlined in [29, Section 2], which chooses points of the form 𝒙±min(Δ,1)𝒆t\bm{x}\pm\min(\Delta,1)\bm{e}_{t} or 𝒙±min(Δ,1)𝒆s±min(Δ,1)𝒆t\bm{x}\pm\min(\Delta,1)\bm{e}_{s}\pm\min(\Delta,1)\bm{e}_{t} for some s,t{1,,n}s,t\in\{1,\ldots,n\} in a specific way. We then replace any of these points which are not in CC by new points in B(𝒙,min(Δ,1))CB(\bm{x},\min(\Delta,1))\cap C while maintaining an invertibile FF.

1:Trust region center 𝒙C\bm{x}\in C and radius Δ>0\Delta>0, number of interpolation points p{n+2,,(n+1)(n+2)/2}p\in\{n+2,\ldots,(n+1)(n+2)/2\}.
2:Generate an initial collection of pp points {𝒚1,,𝒚p}B(𝒙,min(Δ,1))\{\bm{y}_{1},\ldots,\bm{y}_{p}\}\subset B(\bm{x},\min(\Delta,1)) using the method from [29, Section 2].
3:while there exists t{1,,p}t\in\{1,\ldots,p\} with 𝒚tC\bm{y}_{t}\notin C do
4:     Calculate Lagrange polynomial t\ell_{t} and find 𝒚B(𝒙,min(Δ,1))C\bm{y}\in B(\bm{x},\min(\Delta,1))\cap C with t(𝒚)0\ell_{t}(\bm{y})\neq 0.
5:     Replace the interpolation point 𝒚t\bm{y}_{t} by 𝒚\bm{y}.
6:end while
7:return interpolation set {𝒚1,,𝒚p}\{\bm{y}_{1},\ldots,\bm{y}_{p}\}.
Algorithm 2 Construct an underdetermined quadratic interpolation set with invertible FF (4.20).
Theorem 5.5.

For any 𝐱C\bm{x}\in C, Δ>0\Delta>0 and p{n+2,.(n+1)(n+2)/2}p\in\{n+2,\ldots.(n+1)(n+2)/2\}, the output of Algorithm 2 is an interpolation set {𝐲1,,𝐲p}B(𝐱,min(Δ,1))C\{\bm{y}_{1},\ldots,\bm{y}_{p}\}\subset B(\bm{x},\min(\Delta,1))\cap C for which FF (4.20) is invertible.

Proof.

Firstly, Algorithm 2 terminates after at most pp iterations, since an index tt can only ever be selected once (after which the new 𝒚tC\bm{y}_{t}\in C).

Next, all points in the output set were either originally generated in line 2, in which case they must have originally been in both B(𝒙,min(Δ,1))B(\bm{x},\min(\Delta,1)) (from the construction in line 2) and CC (in order to never be replaced in the main loop), or were a 𝒚B(𝒙,min(Δ,1))C\bm{y}\in B(\bm{x},\min(\Delta,1))\cap C chosen in line 4. Hence {𝒚1,,𝒚p}B(𝒙,min(Δ,1))C\{\bm{y}_{1},\ldots,\bm{y}_{p}\}\subset B(\bm{x},\min(\Delta,1))\cap C.

Finally, the initial collection {𝒚1,,𝒚p}\{\bm{y}_{1},\ldots,\bm{y}_{p}\} from [29, Section 2] generates an invertible FF, with an explicit formula for F1F^{-1} given in [29, eqns. 2.11–2.15]. Hence the initial FF satisfies |det(F)|>0|\det(F)|>0. Then at each iteration replace 𝒚t\bm{y}_{t} with a point 𝒚\bm{y} for which t(𝒚)0\ell_{t}(\bm{y})\neq 0, and so the new FF also satisfies |det(F)|>0|\det(F)|>0 by Theorem 5.2. Thus the output interpolation set has an invertible FF. ∎

1:Trust region center 𝒙C\bm{x}\in C and radius Δ>0\Delta>0, number of interpolation points p{n+2,,(n+1)(n+2)/2}p\in\{n+2,\ldots,(n+1)(n+2)/2\}, optional initial interpolation set {𝒚1,,𝒚p}C\{\bm{y}_{1},\ldots,\bm{y}_{p}\}\subset C, poisedness constant Λ>1\Lambda>1.
2:if initial interpolation set not provided or yields singular FF (4.20) or maxt=1,,p𝒚t𝒙>min(Δ,1)\max_{t=1,\ldots,p}\|\bm{y}_{t}-\bm{x}\|>\min(\Delta,1) then
3:     Set {𝒚1,,𝒚p}\{\bm{y}_{1},\ldots,\bm{y}_{p}\} to be the output of Algorithm 2.
4:end if
5:while max𝒚B(𝒙,min(Δ,1))Cmaxt=1,,p|t(𝒚)|>Λ\max_{\bm{y}\in B(\bm{x},\min(\Delta,1))\cap C}\max_{t=1,\ldots,p}|\ell_{t}(\bm{y})|>\Lambda do
6:     Find t{1,,p}t\in\{1,\ldots,p\} and 𝒚B(𝒙,min(Δ,1))C\bm{y}\in B(\bm{x},\min(\Delta,1))\cap C with |t(𝒚)|>Λ|\ell_{t}(\bm{y})|>\Lambda.
7:     Replace the interpolation point 𝒚t\bm{y}_{t} by 𝒚\bm{y} and recalculate Lagrange polynomials.
8:end while
9:return interpolation set {𝒚1,,𝒚p}\{\bm{y}_{1},\ldots,\bm{y}_{p}\}.
Algorithm 3 Construct a Λ\Lambda-poised underdetermined quadratic interpolation set.

We can now present our algorithm to generate a Λ\Lambda-poised set for minimum Frobenius norm interpolation, given in Algorithm 3. This algorithm is very similar to existing methods for unconstrained linear/quadratic (e.g. [16, Algorithm 6.3]) or constrained linear poisedness ([22, Algorithm 4.2]). Essentially we first ensure we have an interpolation set with invertible FF (so we can construct the associated Lagrange polynomials), and we then iteratively find indices tt and points 𝒚\bm{y} with |t(𝒚)|>Λ|\ell_{t}(\bm{y})|>\Lambda, replacing 𝒚t\bm{y}_{t} with 𝒚\bm{y}.

Theorem 5.6.

For any 𝐱C\bm{x}\in C, Δ>0\Delta>0, p{n+2,.(n+1)(n+2)/2}p\in\{n+2,\ldots.(n+1)(n+2)/2\} and Λ>1\Lambda>1, Algorithm 3 produces an interpolation set {𝐲1,,𝐲p}\{\bm{y}_{1},\ldots,\bm{y}_{p}\} which is Λ\Lambda-poised for minimum Frobenius norm interpolation in B(𝐱,Δ)CB(\bm{x},\Delta)\cap C and for which 𝐲t𝐱min(Δ,1)\|\bm{y}_{t}-\bm{x}\|\leq\min(\Delta,1) for all t=1,,pt=1,\ldots,p.

Proof.

If Algorithm 3 terminates, then by the definition of the loop check it must satisfy the requirements of the theorem, so it suffices to show that the ‘while’ loop terminates.

At the start of the ‘while’ loop in line 5, we have that {𝒚1,,𝒚p}B(𝒙,min(Δ,1))C\{\bm{y}_{1},\ldots,\bm{y}_{p}\}\subset B(\bm{x},\min(\Delta,1))\cap C and FF is invertible (either because the ‘if’ statement in line 2 evaluates to false or by Theorem 5.5). Hence at the start of the ‘while’ loop we have |det(F)|>0|\det(F)|>0. Let us call this value d0>0d_{0}>0.

In each iteration of the ‘while’ loop, we increase |det(F)||\det(F)| by a factor of at least Λ2\Lambda^{2} by Theorem 5.2. Hence after kk iterations of the loop, we have |det(F)|Λ2kd0|\det(F)|\geq\Lambda^{2k}d_{0}. However, since by construction our interpolation set is contained in the compact set B(𝒙,min(Δ,1))CB(\bm{x},\min(\Delta,1))\cap C, there is a global upper bound dmaxd_{\max} on |det(F)||\det(F)| (which is a continuous function of the interpolation points). Hence the loop must terminate after at most log(dmax/d0)/(2logΛ)\log(d_{\max}/d_{0})/(2\log\Lambda) iterations. ∎

Remark 5.7.

In the unconstrained case C=nC=\mathbb{R}^{n}, Theorem 5.6 and in particular its reliance on Theorem 5.2 clarifies the argument behind [16, Theorem 6.6].

Theorems 4.7 (with β=1\beta=1) and 5.6 then give us the following.

Corollary 5.8.

Suppose ff and CC satisfy Assumptions 2.1 and 2.2 respectively. Then for any 𝐱C\bm{x}\in C, Δ>0\Delta>0, p{n+2,.(n+1)(n+2)/2}p\in\{n+2,\ldots.(n+1)(n+2)/2\} and Λ>1\Lambda>1, Algorithm 3 produces an interpolation set {𝐲1,,𝐲p}\{\bm{y}_{1},\ldots,\bm{y}_{p}\} for which the model mm generated by (4.2) is fully linear in B(𝐱,Δ)B(\bm{x},\Delta).

6 Conclusions and Future Work

In [22], a weaker notion of fully linear models was used to construct convergent algorithms for (1.1) (namely Algorithm 1), and it was shown that linear interpolation models constructed only with feasible points can yield such models. Here, we show that Algorithm 1 can also be implemented using underdetermined quadratic interpolation models in the minimum Frobenius sense of [26], which is a much more useful construction in practice. As an extra benefit, it also establishes the same results for linear regression models, as a strict generalization of the analysis in [22] that is independently useful in the case where only noisy objective evaluations are available.

The natural extension of this work is to the construction and analysis of second-order accurate models (“fully quadratric” in the language of [16]) built only using feasible points. A concrete implementation of Algorithm 1 with quadratic interpolation models would also be a useful way to extend the implementation from [22] (which only considers nonlinear least-squares objectives, for which linear interpolation models can be practically useful).

References

  • [1] C. Audet and W. Hare, Derivative-Free and Blackbox Optimization, Springer Series in Operations Research and Financial Engineering, Springer, Cham, Switzerland, 2017.
  • [2] F. Augustin and Y. M. Marzouk, A trust-region method for derivative-free nonlinear constrained stochastic optimization, arXiv preprint arXiv:1703.04156, (2017).
  • [3] A. Beck, First-Order Methods in Optimization, MOS-SIAM Series on Optimization, MOS/SIAM, Philadelphia, 2017.
  • [4] M. Benzi, G. H. Golub, and J. Liesen, Numerical solution of saddle point problems, Acta Numerica, 14 (2005), pp. 1–137.
  • [5] S. C. Billups, J. Larson, and P. Graf, Derivative-free optimization of expensive functions with computational error using weighted regression, SIAM Journal on Optimization, 23 (2013), pp. 27–53.
  • [6] C. Cartis, J. Fiala, B. Marteau, and L. Roberts, Improving the flexibility and robustness of model-based derivative-free optimization solvers, ACM Transactions on Mathematical Software, 45 (2019), pp. 32:1–41.
  • [7] C. Cartis, N. I. M. Gould, and P. L. Toint, An adaptive cubic regularization algorithm for nonconvex optimization with convex constraints and its function-evaluation complexity, IMA Journal of Numerical Analysis, 32 (2012), pp. 1662–1695.
  • [8] C. Cartis and L. Roberts, A derivative-free Gauss-Newton method, Mathematical Programming Computation, 11 (2019), pp. 631–674.
  • [9] P. Conejo, E. Karas, and L. Pedroso, A trust-region derivative-free algorithm for constrained optimization, Optimization Methods and Software, 30 (2015), pp. 1126–1145.
  • [10] P. Conejo, E. Karas, L. Pedroso, A. Ribeiro, and M. Sachine, Global convergence of trust-region algorithms for convex constrained minimization without derivatives, Applied Mathematics and Computation, 220 (2013), pp. 324–330.
  • [11] A. Conn, K. Scheinberg, and Ph. Toint, A derivative free optimization algorithm in practice, in 7th AIAA/USAF/NASA/ISSMO Symposium on Multidisciplinary Analysis and Optimization, St. Louis, MO, USA, 1998, American Institute of Aeronautics and Astronautics.
  • [12] A. R. Conn, N. I. M. Gould, and P. L. Toint, Trust-Region Methods, vol. 1 of MPS-SIAM Series on Optimization, MPS/SIAM, Philadelphia, 2000.
  • [13] A. R. Conn, K. Scheinberg, and L. N. Vicente, Geometry of interpolation sets in derivative free optimization, Mathematical Programming, 111 (2007), pp. 141–172.
  • [14]  , Geometry of sample sets in derivative-free optimization: Polynomial regression and underdetermined interpolation, IMA Journal of Numerical Analysis, 28 (2008), pp. 721–748.
  • [15]  , Global convergence of general derivative-free trust-region algorithms to first- and second-order critical points, SIAM Journal on Optimization, 20 (2009), pp. 387–415.
  • [16]  , Introduction to Derivative-Free Optimization, vol. 8 of MPS-SIAM Series on Optimization, MPS/SIAM, Philadelphia, 2009.
  • [17] R. Garmanjani, D. Júdice, and L. N. Vicente, Trust-region methods without using derivatives: Worst case complexity and the nonsmooth case, SIAM Journal on Optimization, 26 (2016), pp. 1987–2011.
  • [18] G. H. Golub and C. F. Van Loan, Matrix Computations, The Johns Hopkins University Press, Baltimore, 3rd ed., 1996.
  • [19] S. Gratton, P. L. Toint, and A. Tröltzsch, An active-set trust-region method for derivative-free nonlinear bound-constrained optimization, Optimization Methods and Software, 26 (2011), pp. 873–894.
  • [20] E. A. E. Gumma, M. H. A. Hashim, and M. M. Ali, A derivative-free algorithm for linearly constrained optimization problems, Computational Optimization and Applications, 57 (2014), pp. 599–621.
  • [21] D. A. Harville, Matrix Algebra From a Statistician’s Perspective, Springer, New York, 2008.
  • [22] M. Hough and L. Roberts, Model-based derivative-free methods for convex-constrained optimization, SIAM Journal on Optimization, 32 (2022), pp. 2552–2579.
  • [23] K. Kulshreshtha, B. Jurgelucks, F. Bause, J. Rautenberg, and C. Unverzagt, Increasing the sensitivity of electrical impedance to piezoelectric material parameters with non-uniform electrical excitation, Journal of Sensors and Sensor Systems, 4 (2015), pp. 217–227.
  • [24] J. Larson and S. C. Billups, Stochastic derivative-free optimization using a trust region framework, Computational Optimization and Applications, 64 (2016), pp. 619–645.
  • [25] J. Larson, M. Menickelly, and S. M. Wild, Derivative-free optimization methods, Acta Numerica, 28 (2019), pp. 287–404.
  • [26] M. J. D. Powell, Least Frobenius norm updating of quadratic models that satisfy interpolation conditions, Mathematical Programming, 100 (2004), pp. 183–215.
  • [27]  , On the use of quadratic models in unconstrained minimization without derivatives, Optimization Methods and Software, 19 (2004), pp. 399–411.
  • [28]  , The NEWUOA software for unconstrained optimization without derivatives, in Large-Scale Nonlinear Optimization, P. Pardalos, G. Di Pillo, and M. Roma, eds., vol. 83, Springer US, Boston, MA, 2006, pp. 255–297.
  • [29]  , The BOBYQA algorithm for bound constrained optimization without derivatives, Tech. Rep. DAMTP 2009/NA06, University of Cambridge, 2009.
  • [30] M. Scheuerer and T. M. Hamill, Statistical postprocessing of ensemble precipitation forecasts by fitting censored, shifted gamma distributions, Monthly Weather Review, 143 (2015), pp. 4578 – 4596.
  • [31] S. F. B. Tett, J. M. Gregory, N. Freychet, C. Cartis, M. J. Mineter, and L. Roberts, Does model calibration reduce uncertainty in climate projections?, Journal of Climate, 35 (2022), pp. 2585–2602.
  • [32] L. N. Trefethen, Approximation Theory and Approximation Practice, SIAM, Philadelphia, 2020.
  • [33] G. Ughi, V. Abrol, and J. Tanner, An empirical study of derivative-free-optimization algorithms for targeted black-box attacks in deep neural networks, Optimization and Engineering, 23 (2022), pp. 1319–1346.
  • [34] S. M. Wild, Derivative-Free Optimization Algorithms for Computationally Expensive Functions, PhD thesis, Cornell University, 2009.
  • [35] S. M. Wild, R. G. Regis, and C. A. Shoemaker, ORBIT: Optimization by radial basis function interpolation in trust-regions, SIAM Journal on Scientific Computing, 30 (2008), pp. 3197–3219.
  • [36] M. Q. Xuan and J. Nocedal, A feasible method for constrained derivative-free optimization, arXiv preprint arXiv:2402.11920, (2024).
  • [37] Z. Zhang, Sobolev seminorm of quadratic functions with applications to derivative-free optimization, Mathematical Programming, 146 (2014), pp. 77–96.