This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

StepDIRECT - A Derivative-Free Optimization Method for Stepwise Functions

Dzung T. Phan1, IBM Research, T. J. Watson Research Center, New York, USA, Email: {phandu@us.,lamnguyen.mltd@}ibm.com    Hongsheng Liu1, Work done while an intern at IBM Research. University of North Carolina, Chapel Hill, NC, USA, E-mail: [email protected]    Lam M. Nguyen111Equal contribution
Abstract

In this paper, we propose the StepDIRECT algorithm for derivative-free optimization (DFO), in which the black-box objective function has a stepwise landscape. Our framework is based on the well-known DIRECT algorithm. By incorporating the local variability to explore the flatness, we provide a new criterion to select the potentially optimal hyper-rectangles. In addition, we introduce a stochastic local search algorithm performing on potentially optimal hyper-rectangles to improve the solution quality and convergence speed. Global convergence of the StepDIRECT algorithm is provided. Numerical experiments on optimization for random forest models and hyper-parameter tuning are presented to support the efficacy of our algorithm. The proposed StepDIRECT algorithm shows competitive performance results compared with other state-of-the-art baseline DFO methods including the original DIRECT algorithm.

1 Introduction

We introduce an optimization algorithm for solving a class of structured black-box deterministic problems, which often arise in data mining and machine learning

(1.1) min𝐱Ωf(𝐱),\min_{{\bf{x}}\in{\Omega}}f({\bf{x}}),

where Ω{\Omega} is a bounded hyper-rectangle, i.e., Ω={𝐱p:𝐥𝐱𝐮}{\Omega}=\{{\bf{x}}\in\mathbb{R}^{p}:{\bf{l}}\leq{\bf{x}}\leq{\bf{u}}\} for some given bounds 𝐥,𝐮p{\bf{l}},{\bf{u}}\in\mathbb{R}^{p}. We assume that f:pf:\mathbb{R}^{p}\to\mathbb{R} is a stepwise function, whose closed-form formula is unavailable or costly to get and store. A stepwise function is a piecewise constant function over a finite number of disjoint subsets. We will point out two motivating important applications in the next section that fit into this framework: hyper-parameter tuning (HPT) for classification and optimizing tree ensemble regression.

Derivative-free optimization (DFO) has a long history and can be traced back to the deterministic direct-search (DDS) method proposed in [12]. DFO algorithms can be classified into two categories: local and global search methods. Local algorithms focus on techniques that can seek a local minimizer. Direct local algorithms find search directions by evaluating the function value directly; for example, Nelder–Mead algorithm [24] and the generalized pattern-search method [30]. Model-based algorithms construct and optimize a local surrogate model for the objective function to determine the new sample point. For instance, the radial basis function is utilized in RBFOpt [4] as the surrogate model, while polynomial models are used in [27, 3]. Due to the local flatness and discontinuous structure of the stepwise function in (1.1), a local search algorithm might easily get stuck at a bad local minimizer. We note that (1.1) can have an infinite number of local minimizers in some applications and gradients are either zero or undefined. Global search algorithms aim at finding a global solution. Methods based on Lipschitzian-based partitioning techniques for underestimating the objective include the dividing rectangles algorithm (DIRECT) [16] and branch-and-bound search [26]. Stochastic search algorithms such as particle swarm optimization (PSO) [17] and the differential evolution (DE) [28] are also considered. Global model-based approaches optimize a surrogate model, usually constructed for the entire search space, e.g., response surface methods [14]. These methods often require a large number of samples in order to obtain a high-fidelity surrogate. Recently, Bayesian optimization using Gaussian process [9, 23] are widely applied in black-box optimization, especially hyper-parameter tuning for machine learning models.

Due to the excessive number of local minimizers and the computational cost of a function evaluation in many applications such as the hyper-parameter tuning problem, we focus on a global optimization algorithm that can reduce function values quickly in a moderate number of iterations. In this work, we propose a global spatial-partitioning algorithm for solving the problem (1.1), based on the idea of the well-known DIRECT algorithm [16]. The key ingredient in our algorithm is the selection of hyper-rectangles to do partition and sample new points. We do not attempt to compute approximate gradients or build a surrogate of the objective function as in many existing methods. The name DIRECT comes from the shortening of the phrase “DIviding RECTangles”, which describes the way the algorithm partitions the feasible domain by a number of hyper-rectangles in order to move towards the optimum. An appealing feature of DIRECT is that it is insensitive to discontinuities and does not rely on gradient estimates. These characteristics are nicely suitable for solving the stepwise function (1.1).

There are two main steps in a DIRECT-type algorithm: 1) selecting a sub-region within Ω\Omega, the so-called potentially optimal hyper-rectangle, in order to get a new sample point over the sub-region, and 2) splitting the potentially optimal hyper-rectangle. In the literature, a number of variants of DIRECT algorithms have been proposed to improve the performance of DIRECT, most of them are devoted to the first step [7, 22, 15]. There are a limited number of papers working on the second step, for example [10]. Another line of research direction for speeding up DIRECT is to incorporate a local search strategy in the framework [20]. Most of these modifications are for a general objective function, but to the best of our knowledge, very few of them have been successful to exploit the problem structure or prior knowledge on the objective function [10, 22]. For example, in [10], the authors present an efficient modification of DIRECT to optimize a symmetric function by including an ordered set in the hyper-rectangle dividing step. In [22], the authors assume that the optimal function value is known in the hyper-rectangle selection step. In this paper, we propose a new DIRECT-type algorithm to utilize the stepwise landscape of the objective function, and make contributions in both two steps. Compared with the original DIRECT and its variants, the proposed StepDIRECT differs from them in the following aspects:

  • We provide a new criterion for selecting potentially optimal hyper-rectangles using the local variability and the best function value.

  • We propose a new stochastic local search algorithm, specially designed for stepwise functions, to explore the solution space more efficiently. As a result, StepDIRECT is a stochastic sampling global optimization algorithm, where incorporating stochasticity can help it to escape from a local minimum.

  • When prior information on the relative importance for decision variables is available, we split the potentially optimal hyper-rectangles along the dimension with higher relative variable importance.

We can prove that the algorithm converges asymptotically to the global optimum under mild conditions. Experimental results reveal that our algorithm performs better than or on par with several state-of-the-art methods in most of the settings studied.

2 Motivating Examples

To fully see the importance of the proposed algorithm, we now show two concrete examples. Before giving a detailed description of these examples, we present a formal definition of a stepwise function as follows.

Definition 1

A function f:pf:\mathbb{R}^{p}\rightarrow\mathbb{R} is called a stepwise function over Ωp\Omega\subset\mathbb{R}^{p} if there is a partition {Ωi}i=1D\{\Omega_{i}\}_{i=1}^{D} of Ω\Omega and real numbers {ci}i=1D\{c_{i}\}_{i=1}^{D} such that ΩiΩj=\Omega_{i}\cap\Omega_{j}=\emptyset for 1i<jD1\leq i<j\leq D and f(𝐱)=i=1Dci𝐈(𝐱Ωi)f({\bf{x}})=\sum_{i=1}^{D}c_{i}\mathbf{I}{({\bf{x}}\in\Omega_{i})}. Here, 𝐈(){\bf{I}}(\cdot) is the indicator function.

2.1 Tree-based Ensemble Regression Optimization

A common approach for the decision-making problem based on data-driven tools is to build a pipeline from historical data, to predictive model, to decisions. A two-stage solution is often used, where the prediction and the optimization are carried out in a separate manner [5]. Firstly, a machine learning model is trained to learn the underlying relationship between the controllable variables and the output. Secondly, the trained model is embedded in the downstream optimization to produce a decision. We assume that the regression model estimated from the data is a good representation of the complex relationship. In this paper, we consider the problem of optimizing a tree ensemble model such as random forests and boosting trees, where the predictive model has been learned from historical data.

The tree ensemble model combines predictions from multiple decision trees. A decision tree uses a tree-like structure to predict the outcome for an input feature vector 𝐱p{\bf{x}}\in\mathbb{R}^{p}. The tt-th regression tree in the ensemble model has the following form

(2.2) ft(𝐱)=i=1Mtct,i𝐈(𝐱Ωt,i)f_{t}({\bf{x}})=\sum_{i=1}^{M_{t}}c_{t,i}\cdot\mathbf{I}{({\bf{x}}\in\Omega_{t,i})}

where Ωt,1,,Ωt,Mt\Omega_{t,1},\ldots,\Omega_{t,M_{t}} represent a partition of feature space. We can see that ft(𝐱)f_{t}({\bf{x}}) is a stepwise function. The tree ensemble regression function outputs predictions by taking the weighted sum of multiple decision trees as

(2.3) f(𝐱)=t=1Twtft(𝐱),f({\bf{x}})=\sum_{t=1}^{T}w_{t}f_{t}({\bf{x}}),

where wtw_{t} is the weight for the decision tree ft(𝐱)f_{t}({\bf{x}}). We assume that parameters ct,i,Ωt,ic_{t,i},\Omega_{t,i} and wtw_{t} have been learned from data, and we use StepDIRECT to optimize f(𝐱)f({\bf{x}}). As we can see that ft(𝐱)f_{t}({\bf{x}}) is constant over each sub-region Rt,iR_{t,i}; hence f(𝐱)f({\bf{x}}) is a stepwise function. Formally, we have the following theorem.

Theorem 2.1

Regression functions for decision tree ensemble methods using a constant value for prediction at each leaf node including random forest and gradient boosting are stepwise functions.

We note that for this type of regression functions, we might get additional information about the objective function such as variable importance for input features of random forests [1].

2.2 Hyper-parameter Tuning for Classification

In machine learning models, we often need to choose a number of hyper-parameters to get a high prediction accuracy for unseen samples [13]. For example, in training an 1\ell_{1}-regularized logistic regression classifier, we tune the sparsity parameter [19]. A common practice is to split the entire dataset into three subsets: training data, validation data, and test data [11]. For a given set of hyper-parameters, we train the model on the training data, then evaluate the model performance on the validation data. Some widely-used performance metrics include accuracy, precision, recall, F1F_{1}-score, and AUC (Area Under the Curve). The goal is to tune hyper-parameters for the model to maximize the performance on the validation data. The task can be formulated as a black-box optimization problem.

First, we start with a binary classifier. Suppose that we are given the MM training samples {(𝐮1,𝐯1),,(𝐮M,𝐯M)}\{({\bf{u}}_{1},{\bf{v}}_{1}),\ldots,({\bf{u}}_{M},{\bf{v}}_{M})\} and NN validation samples {(𝐮1,𝐯1),\{({\bf{u}}_{1},{\bf{v}}_{1}), ,(𝐮N,𝐯N)}\ldots,({\bf{u}}_{N},{\bf{v}}_{N})\} where 𝐮id{\bf{u}}_{i}\in\mathbb{R}^{d} and 𝐯i{±1}{\bf{v}}_{i}\in\{\pm 1\}. For a fixed set of model parameters 𝝀p\bm{\lambda}\in\mathbb{R}^{p}, a classification model h(;𝝀):d{±1}h(\cdot;\bm{\lambda}):\mathbb{R}^{d}\to\{\pm 1\} has been learned based on the training data. When the performance measure is accuracy, the HPT problem is to determine 𝝀\bm{\lambda} that maximizes the test accuracy

Facc(𝝀)=1Ni=1N𝐈(h(𝐮i;𝝀)=𝐯i),F_{acc}(\bm{\lambda})=\frac{1}{N}\sum_{i=1}^{N}{\bf{I}}(h({\bf{u}}_{i};\bm{\lambda})={\bf{v}}_{i}),

where (𝐮i,𝐯i)({\bf{u}}_{i},{\bf{v}}_{i}) is from the validation data. We can see that Facc(𝝀){0,1N,,N1N,1}F_{acc}(\bm{\lambda})\in\{0,\frac{1}{N},\ldots,\frac{N-1}{N},1\} for any 𝝀\bm{\lambda}; hence we expect that the landscape of this function should have stepwise behavior. As an example, we plot in Figure 1 the landscape for the HPT logistic regression by tuning the 1\ell_{1} parameter (denoted by CC in the xx-axis) for training the a1a dataset [2].

Refer to caption
Figure 1: The landscape of the objective function of the HPT problem

The target function for other metrics can also take only a finite number of values. The observation still holds true for a multi-label classifier.

3 StepDIRECT for Stepwise Functions

The StepDIRECT algorithm begins the optimization by transforming the domain of the problem (1.1) linearly into the unit hyper-cube. Therefore, we assume for the rest of the paper that

(3.4) Ω={𝐱p:0xi1}.{\Omega}=\{{\bf{x}}\in\mathbb{R}^{p}:0\leq x_{i}\leq 1\}.

In each iteration, StepDIRECT consists of three main steps. First, we identify a set of potentially optimal hyper-rectangles based on a new criterion. We expect that the hyper-rectangles have a high chance to contain a global optimal solution. The second step is to perform a local search over the potentially optimal hyper-rectangles. Thirdly, we divide the selected hyper-rectangles into smaller hyper-rectangles.

We improve over the general DIRECT-type approaches by: 1) using a different heuristic to select which hyper-rectangle to split, which takes into account the local function variability; 2) using a local search (randomized directional search) to choose a high-quality point as a representative for each hyper-rectangle. Both two proposed strategies make use of the stepwise structure of the objective function.

At the kk-th iteration, let 𝒫k{\cal{P}}_{k} be the set of all generated hyper-rectangles i{\cal{H}}_{i} in the partition of Ω\Omega, where i={𝐱p:𝐥i𝐱𝐮i}{\cal{H}}_{i}=\{{\bf{x}}\in\mathbb{R}^{p}:{\bf{l}}_{i}\leq{\bf{x}}\leq{\bf{u}}_{i}\} for some bounds 𝐥i{\bf{l}}_{i} and 𝐮i{\bf{u}}_{i}. We let fif_{{\cal{H}}_{i}} denote the best function value of ff over i{\cal{H}}_{i} (by evaluating at the sampling points in the sub-region i{\cal{H}}_{i}). fminf_{min} and 𝐱min{\bf{x}}_{min} are the best function value and the best feasible solution over Ω\Omega, respectively. We use mm to count the number of function evaluations and mmaxm^{max} is the maximal number of function evaluations. We present our main algorithm StepDIRECT in Algorithm 1.

Algorithm 1 StepDIRECT
  Define 𝐜1=(0.5,,0.5)p{\bf{c}}_{1}=(0.5,\ldots,0.5)\in\mathbb{R}^{p} and set 𝐱min=𝐜1,fmin=f(𝐜1){\bf{x}}_{min}={\bf{c}}_{1},f_{min}=f({\bf{c}}_{1}), k=m=1k=m=1
  Run the Initialization Step to get 𝒫1,fi for i𝒫1,fmin{\cal{P}}_{1},f_{{\cal{H}}_{i}}\mbox{ for }{\cal{H}}_{i}\in{\cal{P}}_{1},f_{min} and 𝐱min{\bf{x}}_{min} (see Sect. 3.1)
  while mmmaxm\leq m^{max} do
     Identify the set 𝒮{\cal{S}} of all potentially optimal hyper-rectangles in 𝒫k{\cal{P}}_{k} (see Sect. 3.2)
     for j𝒮j\in{\cal{S}} do
        a) Perform a local search over j{\cal{H}}_{j} to get fjf_{{\cal{H}}_{j}} (see Sect. 3.4)
        b) Identify the sides of the rectangle j{\cal{H}}_{j}, divide j{\cal{H}}_{j} into smaller hyper-rectangles along these sides, and update 𝒫k{\cal{P}}_{k} (see Sect. 3.3)
        c) Evaluate ff at centers of new hyper-rectangles, and update fif_{{\cal{H}}_{i}} for every i𝒫k{\cal{H}}_{i}\in{\cal{P}}_{k}.
     end for
     Update m,fminm,f_{min} and xminx_{min} from Steps a and c.
     Update k=k+1k=k+1
  end while

3.1 Initialization Step

We follow the first step of the original DIRECT for initialization [8]. In this step, Algorithm 1 starts with finding f1=f(𝐜1)f_{1}=f({\bf{c}}_{1}) and divides the hyper-cube Ω{\Omega} by evaluating the function values at 2p2p points 𝐜1±δ𝐞i,i{1,,p}{\bf{c}}_{1}\pm\delta{\bf{e}}_{i},i\in\{1,\ldots,p\}, where δ=13\delta=\frac{1}{3} and 𝐞i{\bf{e}}_{i} is the ii-th unit vector. The idea of DIRECT is to select a hyper-rectangle with a small function value in a large search space; hence let us define

si=min{f(𝐜1+δ𝐞i),f(𝐜1δ𝐞i)},i{1,,p}s_{i}=\min\{f({\bf{c}}_{1}+\delta{\bf{e}}_{i}),f({\bf{c}}_{1}-\delta{\bf{e}}_{i})\},\forall i\in\{1,\ldots,p\}

and the dimension with the smallest sis_{i} is partitioned into thirds. By doing so, 𝐜1±δ𝐞i{\bf{c}}_{1}\pm\delta{\bf{e}}_{i} are the center of the newly generated hyper-rectangles. We initialize the sets 𝒫1,1,𝒞1{\cal{P}}_{1},{\cal{I}}_{1},{\cal{C}}_{1}, values fif_{{\cal{H}}_{i}} for every i𝒫1{\cal{H}}_{i}\in{\cal{P}}_{1}, and update fmin=min{fi:i𝒫1}f_{min}=\min\{f_{{\cal{H}}_{i}}:{\cal{H}}_{i}\in{\cal{P}}_{1}\} and corresponding 𝐱min{\bf{x}}_{min}.

3.2 Potentially Optimal Hyper-rectangles

In this subsection, we propose a new criterion for StepDIRECT to select the next potentially optimal hyper-rectangles, which should be divided in this iteration. In the original DIRECT algorithm, every hyper-rectangle i{\cal{H}}_{i} is represented by a pair (fi,di)(f_{{\cal{H}}_{i}},d_{i}), where fif_{{\cal{H}}_{i}} is the function value estimated at the centre 𝐜i{\bf{c}}_{i} of i{\cal{H}}_{i} and did_{i} is the distance from the center of hyper-rectangle i{\cal{H}}_{i} to its vertices. The criterion to select hyper-rectangles, i.e., potentially optimal hyper-rectangles, for further divided is based on a score computed from (fi,di)(f_{{\cal{H}}_{i}},d_{i}). A pure local strategy would select the hyper-rectangle with the smallest value for fif_{{\cal{H}}_{i}}, while a pure global search strategy would choose one of the hyper-rectangles with the biggest value did_{i}. The main idea of the DIRECT algorithm is to balance between the local and global search, which can be achieved by using a score weighting the two search strategies: fiKdif_{{\cal{H}}_{i}}-Kd_{i} for some K>0K>0. The original definition for potentially optimal hyper-rectangles j{\cal{H}}_{j} for DIRECT [16] is based on two conditions:

(3.5) f(𝐜j)Kdj\displaystyle f({\bf{c}}_{j})-Kd_{j}\; f(𝐜i)Kdi,i𝒫k\displaystyle\leq\;f({\bf{c}}_{i})-Kd_{i},\;\;\forall{\cal{H}}_{i}\in{\cal{P}}_{k}
(3.6) f(𝐜j)Kdj\displaystyle f({\bf{c}}_{j})-Kd_{j}\; fminϵ,\displaystyle\leq\;f_{min}-\epsilon,

for some ϵ>0\epsilon>0.

An example for identifying these potentially optimal hyper-rectangles by the original DIRECT is given in Figure 2. Each dot in a two dimensional space represents a hyper-rectangle, three red dots with the smallest value for f(𝐜j)Kdjf({\bf{c}}_{j})-Kd_{j} for each KK and a significant improvement (i.e., f(𝐜j)Kdj>fminϵf({\bf{c}}_{j})-Kd_{j}>f_{min}-\epsilon) are considered as potentially optimal.

Refer to caption
Figure 2: Identifying potentially optimal hyper-rectangles for DIRECT

The proposed StepDIRECT searches locally and globally by dividing all hyper-rectangles that meet the criteria in Definition 2. A hyper-rectangle i{\cal{H}}_{i} is now represented by a triple (fi,di,σi)(f_{{\cal{H}}_{i}},d_{i},\sigma_{i}), in which we introduce new notations for fif_{{\cal{H}}_{i}} and σi\sigma_{i}. We use a higher quality value fif_{{\cal{H}}_{i}} returned by a local search as a representative for each hyper-rectangle, and a flatness measurement σi\sigma_{i} in the proposed criterion.

Definition 2

(potentially optimal hyper-rectangle) Let ϵ>0\epsilon>0 be a small positive constant and fminf_{min} be the current best function value over Ω\Omega. Suppose fif_{{\cal{H}}_{i}} is the best function value over the hyper-rectangle i{\cal{H}}_{i}. A hyper-rectangle j{\cal{H}}_{j} is said to be potentially optimal if there exists K>0K>0 such that

(3.7) fjKdjσj\displaystyle f_{{\cal{H}}_{j}}-Kd_{j}\sigma_{j}\; fiKdiσi,i𝒫k\displaystyle\leq\;f_{{\cal{H}}_{i}}-Kd_{i}\sigma_{i},\quad\forall{\cal{H}}_{i}\in{\cal{P}}_{k}
(3.8) fjKdjσj\displaystyle f_{{\cal{H}}_{j}}-Kd_{j}\sigma_{j}\; fminϵ|fminfmedian|,\displaystyle\leq\;f_{min}-\epsilon|f_{min}-f_{median}|,

where σj>0\sigma_{j}>0 quantifies the local variability of ff on the hyper-rectangle j{\cal{H}}_{j} (defined in (3.9)) and fmedianf_{median} is the median of all function values in history.

In Eq. (3.7), we introduce a new notation σj\sigma_{j} to measure the local landscape of the stepwise function. Furthermore, we replace f(𝐜j)f({\bf{c}}_{j}) as in the original DIRECT by fjf_{{\cal{H}}_{j}}, which is computed with the help of local search. The value fjf_{{\cal{H}}_{j}} for j{\cal{H}}_{j} generated during Steps a and c is better estimating the global solution of ff over j{\cal{H}}_{j} because fjf(𝐜j)f_{{\cal{H}}_{j}}\leq f({\bf{c}}_{j}). In Eq. (3.8), in order to eliminate the sensitivity to scaling issue, |fmin||f_{min}| is replaced by the difference |fminfmedian||f_{min}-f_{median}| as suggested in [7].

Now we discuss how to compute σj\sigma_{j}. At the kk-th iteration of the StepDIRECT algorithm, for hyper-rectangle j𝒫k{\cal{H}}_{j}\in{\cal{P}}_{k} with center 𝐜j{\bf{c}}_{j} and diameter 2dj2d_{j}, we define its set of neighborhood hyper-rectangles as

𝒩j={i𝒫k:𝐜i𝐜jλdj}{\cal{N}}_{j}=\{{\cal{H}}_{i}\in\mathcal{P}_{k}:\left\|{\bf{c}}_{i}-{\bf{c}}_{j}\right\|\leq\lambda d_{j}\}

for some λ>0\lambda>0. Then, 𝒩j{\cal{N}}_{j} can be further divided into two disjoint subsets 𝒩jD{\cal{N}}_{j}^{D} and 𝒩jE{\cal{N}}_{j}^{E} such that

𝒩jD={i𝒩j:fifj}and\displaystyle{\cal{N}}_{j}^{D}=\{{\cal{H}}_{i}\in{\cal{N}}_{j}:f_{{\cal{H}}_{i}}\neq f_{{\cal{H}}_{j}}\}\ \text{and}
𝒩jE={i𝒩j:fi=fj}.\displaystyle{\cal{N}}_{j}^{E}=\{{\cal{H}}_{i}\in{\cal{N}}_{j}:f_{{\cal{H}}_{i}}=f_{{\cal{H}}_{j}}\}.

The local variability estimator for hyper-rectangle j{\cal{H}}_{j} is defined as

(3.9) σj=max{|𝒩jD||𝒩j|,ϵσ}[ϵσ,1],\sigma_{j}=\max\Big{\{}\dfrac{|{\cal{N}}_{j}^{D}|}{|{\cal{N}}_{j}|},\epsilon_{\sigma}\Big{\}}\in[\epsilon_{\sigma},1],

where 0<ϵσ<10<\epsilon_{\sigma}<1 is a small positive number to prevent σj=0\sigma_{j}=0 when 𝒩jD={\cal{N}}_{j}^{D}=\emptyset. In our experiments, we set λ=2\lambda=2 and ϵσ=108\epsilon_{\sigma}=10^{-8} as default. The meaning of σj\sigma_{j} can be interpreted as follows. For each hyper-rectangle j{\cal{H}}_{j}, a large value for σj\sigma_{j} indicates that it is likely to have more different function values in the neighbourhood of the center 𝐜j{\bf{c}}_{j}, which requires more local exploration. We propose to include the hyper-rectangle to the set of potentially optimal hyper-rectangles.

By Definition 2, we can efficiently find potentially optimal hyper-rectangles based on the following lemma.

Lemma 3.1

Let ϵ>0\epsilon>0 be the positive constant used in Definition 2 and fminf_{min} be the current best function value. Let \mathcal{I} be the set of indices of all existing hyper-rectangles. For each jj\in\mathcal{I}, let l={i:diσi<djσj},b={i:diσi>djσj},e={i:diσi=djσj}\mathcal{I}_{l}=\{i\in\mathcal{I}:d_{i}\sigma_{i}<d_{j}\sigma_{j}\},\mathcal{I}_{b}=\{i\in\mathcal{I}:d_{i}\sigma_{i}>d_{j}\sigma_{j}\},\mathcal{I}_{e}=\{i\in\mathcal{I}:d_{i}\sigma_{i}=d_{j}\sigma_{j}\} and gi=fifjdiσidjσjg_{i}=\frac{f_{{\cal{H}}_{i}}-f_{{\cal{H}}_{j}}}{d_{i}\sigma_{i}-d_{j}\sigma_{j}} for all iji\neq j. The hyper-rectangle j{\cal{H}}_{j} is potentially optimal if and only if three following conditions hold:

  • (a)

    fjfif_{{\cal{H}}_{j}}\leq f_{{\cal{H}}_{i}} for every ie;i\in\mathcal{I}_{e};

  • (b)

    maxilgiminibgi;\max_{i\in\mathcal{I}_{l}}g_{i}\leq\min_{i\in\mathcal{I}_{b}}g_{i};

  • (c)

    If fmedian>fminf_{median}>f_{min} then

    (3.10) ϵfminfj|fminfmedian|+djσjminibgi|fminfmedian|;\displaystyle\epsilon\leq\frac{f_{min}-f_{{\cal{H}}_{j}}}{|f_{min}-f_{median}|}+\frac{d_{j}\sigma_{j}\min_{i\in\mathcal{I}_{b}}g_{i}}{|f_{min}-f_{median}|};

    otherwise,

    (3.11) fjdjσjminibgi+fmin.f_{{\cal{H}}_{j}}\leq d_{j}\sigma_{j}\min_{i\in\mathcal{I}_{b}}g_{i}+f_{min}.

We note that all proofs are delegated to the supplementary material. Definition 2 differs from the potentially optimal hyper-rectangle definition proposed in [16] (i.e., (3.5),(3.6)) in the following three aspects:

  • (3.7) and (3.8) include the local variability of ff on the hyper-rectangle. For the stepwise function, this quantity helps balance the local exploration and global search.

  • (3.8) uses |fminfmedian||f_{min}-f_{median}| to remove sensitivity to the linear and additive scaling and improve clustering of sample points near optimal solutions.

  • fif_{{\cal{H}}_{i}} can be different from f(𝐜i)f({\bf{c}}_{i}), i.e., fif(𝐜i)f_{{\cal{H}}_{i}}\leq f({\bf{c}}_{i}).

3.3 Dividing Potentially Optimal Hyper-rectangles

Once a hyper-rectangle has been identified as potentially optimal, StepDIRECT divides this hyper-rectangle into smaller hyper-rectangles. We will take into account the side length and the variable importance measure of each dimensions.

In tree ensemble regression optimization, we can obtain the variable importance which indicate the relative importance of different features [1]. In general, the variable importance relates to a significant function value change if we move along this direction, and also the number of discontinuous points along the coordinate. As a result, we tend to make more splits along the direction with higher variable importance. Formally, let 𝐰=(w1,,wp)+p{\bf{w}}=(w_{1},\ldots,w_{p})\in\mathbb{R}_{+}^{p} be the normalized variable importance with 𝐰1=1||{\bf{w}}||_{1}=1 and the length of the hyper-rectangle be 𝐥=(l1,,lp)+p{\bf{l}}=(l_{1},\ldots,l_{p})\in\mathbb{R}_{+}^{p}, then we define

(3.12) 𝐯=(v1,,vp)=(w1l1,,wplp){\bf{v}}=(v_{1},\ldots,v_{p})=(w_{1}l_{1},\ldots,w_{p}l_{p})

as the relative importance of all coordinates for the selected hyper-rectangle. We choose the coordinate with the highest value viv_{i} to divide the hyper-rectangle into thirds.

If no variable importance measure is provided, then we take 𝐰=(1/p,,{\bf{w}}=(1/p,\ldots, 1/p)+p1/p)\in\mathbb{R}_{+}^{p}. The dividing procedure is the same as the original DIRECT by splitting the dimensions with the largest side length.

3.4 Local Search for Stepwise Function

In this subsection, we introduce a randomized local search algorithm designed for minimizing stepwise function f(𝐱)f({\bf{x}}) over a bounded box

={𝐱p:𝐚𝐱𝐛}.{\cal{B}}=\{{\bf{x}}\in\mathbb{R}^{p}:{\bf{a}}\leq{\bf{x}}\leq{\bf{b}}\}.

It has been known that a DIRECT-type algorithm has a good ability to locate promising regions of the feasible space, and a good local search procedure can help to quickly converge to the global optimum [21, 20]. Since the function values over the neighborhood of a point are almost surely constant for a stepwise function, existing local searches based on a local approximation of the objective function might fail to move to a better feasible solution. Hence, we propose a novel local search method, shown in Algorithm 2. Different from the classical trust-region methods, Algorithm 2 will increase the stepsize when no better solution is discovered in the current iteration, i.e., τ>1\tau>1. This change is motivated by the stepwise landscape of the objective function.

Algorithm 2 Local_Search (){\cal{B}})
  input {\cal{B}}
  Given starting point 𝐱{\bf{x}}\in{\cal{B}}, τ>1,0<δmin<δmax,δ[δmin,δmax]\tau>1,0<\delta_{min}<\delta_{max},\delta\in[\delta_{min},\delta_{max}], tmax>0t^{max}>0.
  Initialize t=0t=0, fmin=f(𝐱)f^{{\cal{B}}}_{min}=f({\bf{x}}), and 𝐱min=𝐱{\bf{x}}^{{\cal{B}}}_{min}={\bf{x}}.
  while t<tmaxt<t^{max} do
     - Compute fcur=f(𝐱)f_{cur}=f({\bf{x}})
     - Randomly generate search directions 𝒟\mathcal{D}
     - Define S=argmin𝐝𝒟{f(𝐱+δ𝐝):𝐱+δ𝐝}S=\mathop{\rm argmin}_{{\bf{d}}\in\mathcal{D}}\{f({\bf{x}}+\delta{\bf{d}}):{\bf{x}}+\delta{\bf{d}}\in{\cal{B}}\}
     - Randomly sample 𝐝S{\bf{d}}^{*}\in S, define f=f(𝐱+δ𝐝)f^{*}=f({\bf{x}}+\delta{\bf{d}}^{*})
     if f<fminf^{*}<f^{{\cal{B}}}_{min} then
        fminf,𝐱min𝐱+δ𝐝f^{{\cal{B}}}_{min}\leftarrow f^{*},{\bf{x}}_{min}^{{\cal{B}}}\leftarrow{\bf{x}}+\delta{\bf{d}}^{*}
     end if
     if f>fcurf^{*}>f_{cur} then
        δmin{τδ,δmax}\delta\leftarrow\min\{\tau\delta,\delta_{max}\}
     else if f<fcurf^{*}<f_{cur} then
        δmax{δ/τ,δmin}\delta\leftarrow\max\{\delta/\tau,\delta_{min}\}
     end if
     𝐱𝐱+δ𝐝{\bf{x}}\leftarrow{\bf{x}}+\delta{\bf{d}}^{*}
     tt+|𝒟|+1t\leftarrow t+|{\cal{D}}|+1
  end while
  return fmin,𝐱minf^{{\cal{B}}}_{min},{\bf{x}}^{{\cal{B}}}_{min}.

We provide two different options to generate the search directions 𝒟\mathcal{D}:

  • (a)

    By coordinate strategy, for each axis ii, we take two directions 𝐞i{\bf{e}}_{i} and 𝐞i-{\bf{e}}_{i} with a probability wiw_{i} calculated from variable importance.

  • (b)

    By random sampling from the unit sphere in p\mathbb{R}^{p}.

The first option is more suitable when the discontinuous boundaries are in parallel to the axes, for example, ensemble tree regression functions. The second strategy works for general stepwise functions when the boundaries for each level set is not axis-parallel. We store the feasible sampled points 𝐱+δ𝐝{\bf{x}}+\delta{\bf{d}} with their function values generated during local search in order to update the best function value fif_{{\cal{H}}_{i}} for each hyper-rectangle i{\cal{H}}_{i} for a new partition.

3.5 Global Convergence of StepDIRECT

Now, we are able to provide the global convergence result for StepDIRECT. We assume that 𝐰=(1/p,,1/p)p{\bf{w}}=(1/p,\ldots,1/p)\in\mathbb{R}^{p} for simplicity and the following theorem is still valid as long as wj>0w_{j}>0 for every j{1,,p}j\in\{1,\ldots,p\}.

Theorem 3.1

Suppose that 𝐰=(1/p,,1/p)p{\bf{w}}=(1/p,\ldots,1/p)\in\mathbb{R}^{p} and ff is continuous in a neighborhood of a global optimum. Then, StepDIRECT converges to the globally optimal function value for the stepwise function ff over the bounded box defined in (1.1).

4 Numerical Experiments

In this section, we test the performance of StepDIRECT algorithm on two problems: optimization for random forest regression function and hyper-parameter tuning for classification. As explained in Section 2, we need to minimize a stepwise target function. We denote StepDIRECT-0 by a variant of StepDIRECT when we skip Local_Search in Step a of Algorithm 1.

4.1 Optimization for Random Forest Regression Function

We consider the minimization for Random Forest regression function over a bounded box constraint. We used the boston, diabetes, mpg and bodyfat data sets from the UCI Machine Learning Repository [6]. Table 1 provides the details of these four data sets.

Table 1: Dataset statistics, pp: number of features, NN: number of samples
Dataset boston diabetes mpg bodyfat
NN 506 768 234 252
pp 13 8 11 14

We train the random forest regression function on these data sets with 100 trees and use default settings for other parameters in scikit-learn package [25]. For comparison, we run the following optimization algorithms: DIRECT [16]111https://scipydirect.readthedocs.io/en/latest/, DE [28]222https://docs.scipy.org/doc/scipy/reference/index.html, PSO [17]333https://pypi.org/project/pyswarms/, RBFOpt [4]444https://projects.coin-or.org/RBFOpt, StepDIRECT-0, and StepDIRECT. For both StepDIRECT-0 and StepDIRECT in Algorithm 1, we set the maximum number of function evaluations mmax=2000m^{max}=2000. The same function evaluation limit is used for DIRECT. In Algorithm 2, we select δ=1,δmin=0.001,δmax=2.5\delta=1,\delta_{min}=0.001,\delta_{max}=2.5, τ=1.5,|𝒟|=5\tau=1.5,|{\cal{D}}|=5, and tmax=1.5pt^{max}=1.5p. For all other algorithms, we use the default settings. We run all algorithms for 20 times and report the mean and standard deviation results for final objective function values (denoted by “obj”) and running times in seconds (denoted by “time”) in Table 2.

Table 2: Optimization of random forest regression functions for four data sets (mean±\pmstandard deviation). “obj” is the final objective function value, “time” is the running time in seconds. An entry is in bold if the mean value is the lowest in the column.
Algorithms boston diabetes mpg bodyfat
DIRECT obj 32.60 77.66 10.55 1.31
(±\pm 0.00) (±\pm 0.00) (±\pm 0.00) (±\pm 0.00)
time 8.21 9.14 1.52 6.15
(±\pm 0.01) (±\pm 0.06) (±\pm 0.04) (±\pm0.02)
StepDIRECT-0 obj 28.66 75.8 10.37 1.27
(±\pm 0.00) (±\pm 0.00) (±\pm 0.00) (±\pm 0.00)
time 7.59 9.41 1.66 6.03
(±\pm 0.01) (±\pm 0.16) (±\pm 0.09) (±\pm 0.03)
StepDIRECT obj 28.35 58.68 9.85 1.26
(±\pm0.02) (±\pm 1.04) (±\pm 0.03) (±\pm 0.01)
time 7.87 10.74 1.88 6.63
(±\pm 0.23) (±\pm 0.15) (±\pm 0.17) (±\pm 0.26)
DE obj 28.40 69.71 9.82 1.29
(±\pm 0.08) (±\pm 1.72) (±\pm 0.01) (±\pm 0.01)
time 15.22 23.86 8.85 3.34
(±\pm 0.79) (±\pm 0.45) (±\pm 0.70) (±\pm 0.29)
PSO obj 28.41 80.55 9.87 1.28
(±\pm 0.08) (±\pm 3.82) (±\pm 0.06) (±\pm 0.05)
time 30.60 34.67 34.47 29.49
(±\pm 0.49) (±\pm 11.91) (±\pm 0.88) (±\pm 0.14)
RBFOpt obj 28.56 87.34 10.40 1.33
(±\pm 0.02) (±\pm 3.82) (±\pm 0.00) (±\pm 0.04)
time 14.70 11.95 8.70 12.83
(±\pm 2.85) (±\pm 1.17) (±\pm 0.08) (±\pm 1.83)

From Table 2, we see that the function values returned by StepDIRECT-0 are better than those of the original DIRECT. It illustrates the benefit of our proposed strategy for identifying potentially optimal hyper-rectangles and dividing hyper-rectangles which efficiently exploits the stepwise function structure. For DIRECT and StepDIRECT-0, their objective function outputs do not change for different runs since they are deterministic algorithms.

To further see the advantage of local search incorporated into StepDIRECT-0, in StepDIRECT the local search is initialized with the best point in the potentially optimal hyper-rectangle and runs with search directions 𝒟\mathcal{D} randomly generated by the coordinate strategy. Compared with StepDIRECT-0, we notice that the StepDIRECT algorithm achieved lower objective function values. From Table 2, StepDIRECT shows the best overall performance in terms of solution quality except for mpg dataset. By embedding the local search, we can significantly improve the solution quality. In general, the proposed algorithm runs faster than other baseline methods DE, PSO, and RBFOpt, except for the run time of bodyfat dataset, DE outperforms StepDIRECT.

4.2 Hyper-parameter Tuning for Classification

We tune the hyper-parameters for: 1) multi-class classification with linear support vector machines (SVM) and logistic regression, and 2) imbalanced binary classification with RBF-kernel SVM [11]. We use three datasets: MNIST from [18], PenDigits from the UCI Machine Learning Repository [6], and a synthetic dataset Synth generated by the Mldatagen generator [29]. For all datasets, we set the ratio among training, validation and test data partitions as 3:1:13:1:1 and report the performance on the test dataset.

4.2.1 Multi-class Classification

For multi-class classification problems with the number of classes K3K\geq 3, there are two widely-used approaches: “one-vs-all” and “one-vs-one” strategies.

One-vs-all classification: In this experiment, we tune the hyper-parameters for multi-class support vector machines and logistic regression. For multi-class classification problems, we take the “one-vs-all” approach to turning a binary classifier into a multi-class classifier. For each class k{1,,K}k\in\{1,\ldots,K\} and a given hyper-parameter CkC_{k}, the one-vs-all SVM solves the following problem for the dataset {𝐱i,yi}i=1N\{{\bf{x}}_{i},y_{i}\}_{i=1}^{N}

(4.13) min{𝐰k,bk}\displaystyle\min_{\{{\bf{w}}^{k},b^{k}\}} 12𝐰k2+Cki=1Nξik\displaystyle\dfrac{1}{2}\left\|{\bf{w}}^{k}\right\|^{2}+C_{k}\sum_{i=1}^{N}\xi_{i}^{k}
s.t. (𝐰k)T𝐱i+bk1ξik,\displaystyle({\bf{w}}^{k})^{T}{\bf{x}}_{i}+b^{k}\geq 1-\xi_{i}^{k}, if yi=k\displaystyle\quad\text{if }y_{i}=k
(𝐰k)T𝐱i+bk1+ξik,\displaystyle({\bf{w}}^{k})^{T}{\bf{x}}_{i}+b^{k}\leq-1+\xi_{i}^{k}, if yik\displaystyle\quad\text{if }y_{i}\neq k
ξik0,i{1,,N}.\displaystyle\xi_{i}^{k}\geq 0,\,i\in\{1,\ldots,N\}.

The class of each point 𝐱i{\bf{x}}_{i} is determined by

(4.14) class of 𝐱i=argmaxk{1,,K}{(𝐰k)T𝐱i+bk}.\text{class of }{\bf{x}}_{i}=\mathop{\rm argmax}_{k\in\{1,\ldots,K\}}\{({\bf{w}}^{k})^{T}{\bf{x}}_{i}+b^{k}\}.

Different from many default implementations by taking the same value for all Ck,1kKC_{k},1\leq k\leq K, we allow them to take different values because the margin between different classes may vary significantly from each other. The one-vs-all logistic regression follows the same approach by replacing (4.13) with the binary logistic regression classifier.

We search KK hyper-parameters CkC_{k} for KK classifiers in the log space of Ω=[103,103]K\Omega=[10^{-3},10^{3}]^{K}. The number of hyper-parameters for each dataset is given in Table 3, which is denoted by “pp”. For comparison, we run the following algorithms: Random Search (RS), DIRECT, and Bayesian Optimization (BO)555https://rise.cs.berkeley.edu/projects/tune/. The widely used Grid Search (GS) is not considered here because GS is generally not applicable when the search space dimension is beyond 5. For all algorithms, the computational budget is mmax=100Km^{max}=100K in terms of the number of training KK base classifiers. The results for one-vs-all SVM and one-vs-all logistic regression are shown in Tables 3 and 4, respectively. We can observe that StepDIRECT gives the best test accuracy for these data sets. Compared with the random search RS, StepDIRECT improves the test accuracy 0.62.0%0.6-2.0\%. We notice that the original DIRECT algorithm makes little improvement and often gets stuck in the local region until consuming all running budgets, while the StepDIRECT algorithm can make consistent progress by balancing the local exploration and global search.

Table 3: One-vs-all SVM, KK: the number of classes, pp: the number of tuning parameters
Dataset KK pp RS DIRECT StepDIRECT BO
Synth 3 3 74.8% 75.3% 76.8% 75.0%
PenDigits 10 10 94.0% 94.2% 95.2% 94.9%
MNIST 10 10 91.3% 91.8% 92.2% 92.0%
Table 4: One-vs-all logistic regression, KK: the number of classes, pp: the number of tuning parameters
Dataset KK pp RS DIRECT StepDIRECT BO
Synth 3 3 74.5% 75.2% 75.5% 75.5%
PenDigits 10 10 94.8% 95.0% 95.3% 95.1%
MNIST 10 10 92.0% 92.2% 92.6% 92.3%

One-vs-one classification: For each pair of classes ii and j,1i<jKj,1\leq i<j\leq K, we need to tune an associated model parameter CijC_{ij}. The one-vs-one SVM solves the following problem

(4.15) min{𝐰ij,bij}\displaystyle\min_{\{{\bf{w}}^{ij},b^{ij}\}} 12𝐰ij2+Cijt=1Nξtij\displaystyle\dfrac{1}{2}\left\|{\bf{w}}^{ij}\right\|^{2}+C_{ij}\sum_{t=1}^{N}\xi_{t}^{ij}
s.t. (𝐰ij)T𝐱t+bij1ξtij,\displaystyle({\bf{w}}^{ij})^{T}{\bf{x}}_{t}+b^{ij}\geq 1-\xi_{t}^{ij}, if yt=i\displaystyle\quad\text{if }y_{t}=i
(𝐰ij)T𝐱t+bij1+ξtij,\displaystyle({\bf{w}}^{ij})^{T}{\bf{x}}_{t}+b^{ij}\leq-1+\xi_{t}^{ij}, if yt=j\displaystyle\quad\text{if }y_{t}=j
ξtij0,t{1,,N}\displaystyle\xi_{t}^{ij}\geq 0,\,t\in\{1,\ldots,N\}

for all pairs (i,j),1i<jK(i,j),1\leq i<j\leq K. There are K(K1)2\frac{K(K-1)}{2} hyper-parameters CijC_{ij} to be tuned. We use the following voting strategy: if sgn((𝐰ij)T𝐱t+bij)\text{sgn}(({\bf{w}}^{ij})^{T}{\bf{x}}_{t}+b^{ij}) says 𝐱t{\bf{x}}_{t} is in the ii-th class, then the vote for the ii-th class is added by one. Otherwise, the vote for the jj-th class is increased by one. Then, the class of 𝐱t{\bf{x}}_{t} has the largest vote. Similar to the one-vs-all case, we search different hyper-parameters for all pair classifiers (Ω=[103,103]K(K1)/2\Omega=[10^{-3},10^{3}]^{K(K-1)/2}). For all algorithms, the budget is 10K(K1)10K(K-1) in terms of training a base classifier. The results are shown in Tables 5 and 6 for one-vs-one SVM and one-vs-one logistic regression, respectively. Overall, StepDIRECT performs the best in many cases compared to the base-line algorithms. Especially, it outperforms the Bayesian optimization algorithm BO in 4 out of 6 cases.

Table 5: One-vs-one SVM, KK: the number of classes, pp: the number of tuning parameters
Dataset KK pp RS DIRECT StepDIRECT BO
Synth 3 3 78.0% 78.5% 79.5% 78.8%
PenDigits 10 45 97.9% 98.2% 98.8% 98.6%
MNIST 10 45 91.5% 92.2% 93.0% 93.2%
Table 6: One-vs-one logistic regression, KK: the number of classes, pp: the number of tuning parameters
Dataset KK pp RS DIRECT StepDIRECT BO
Synth 3 3 77.5% 77.75% 78.25% 78.5%
PenDigits 10 45 96.8% 97.8% 98.4% 97.4%
MNIST 10 45 93.3% 93.8% 94.1% 93.7%

4.2.2 Imbalanced Binary Classification

In the second example, we consider to tune parameters of RBF-SVM with 2\ell_{2} regularization CC, kernel width γ\gamma, and class weight cpc_{p} for the imbalanced binary classification problems. In this setting, we compare the performance of StepDIRECT with DIRECT, RS, BO, and GS in tuning a three-dimensional hyper-parameter w=(C,γ,cp)w=(C,\gamma,c_{p}) to achieve a high test accuracy.

In our experiments, we used five binary classification datasets, as shown in Table 7, which can be obtained from LIBSVM package [2].

Table 7: Data set statistics, pp: the number of features, NN: the number of samples, N/N+N_{-}/N_{+}: the class distribution ratio
Dataset pp NN N/N+N_{-}/N{+}
fourclass 2 862 1.8078
diabetes 8 768 1.8657
statlog 24 1,000 2.3333
svmguide3 22 1,284 3.1993
ijcnn1 22 141,691 9.2643
Table 8: RBF-SVM
Dataset RS GS DIRECT StepDIRECT BO
diabetes 79.8% 81.2% 79.2% 81.8% 81.3%
fourclass 100.0% 80.9% 100.0% 100.0% 100.0%
statlog 77.5% 78.5% 80.5% 81.6% 81.0%
svmguide3 83.5% 83.9% 84.3% 84.3% 82.9%
ijcnn1 97.6% 98.3% 94.6% 98.2% 98.3%

For all algorithms, the feasible set is chosen as C[103,103]C\in[10^{-3},10^{3}], γ[106,100]\gamma\in[10^{-6},10^{0}] and cp[102,102]c_{p}\in[10^{-2},10^{2}]. The search is conducted in the log-scale. For the GS algorithm, we uniformly select 5 candidates for each parameter. For a fair comparison, we set the budget as 125 for the limit of number of function evaluations for all algorithms. Table 8 shows the test accuracy for the experiment.

As we can see from Table 8, StepDIRECT is always the top performer when compared with DIRECT and RS algorithms. Furthermore, the proposed algorithm achieves competitive performance with BO.

5 Conclusion

In this paper, we have proposed the StepDIRECT algorithm for solving the black-box stepwise function, which can exploit the special structure of the problem. We have introduced a new definition to identify the potentially optimal hyper-rectangles and divide the hyper-rectangles. A stochastic local search is embedded to improve solution quality and speed up convergence. The global convergence is proved and numerical results on two practical machine learning problems show the state-the-art-performance of algorithm. In the future, we plan to combine the ideas from StepDIRECT and Bayesian Optimization, and develop practical algorithms for tuning discrete hyper-parameters in machine learning and deep neural networks.

References

  • [1] Leo Breiman. Random forests. Machine learning, 45(1):5–32, 2001.
  • [2] Chih-Chung Chang and Chih-Jen Lin. LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2:1–27, 2011.
  • [3] A. R. Conn, K. Scheinberg, and Ph. L. Toint. Recent progress in unconstrained nonlinear optimization without derivatives. Mathematical Programming, 79:397–414, 1997.
  • [4] Alberto Costa and Giacomo Nannicini. RBFOpt: an open-source library for black-box optimization with costly function evaluations. Mathematical Programming Computation, 10(4):597–629, 2018.
  • [5] Emir Demirović, Peter Stuckey, James Bailey, Jeffrey Chan, Chris Leckie, Kotagiri Ramamohanarao, and Tias Guns. An investigation into prediction+optimisation for the knapsack problem. In CPAIOR, pages 241–257, 2019.
  • [6] Dheeru Dua and Casey Graff. UCI machine learning repository, http://archive.ics.uci.edu/ml, 2021.
  • [7] D. E. Finkel and C. T. Kelley. Additive scaling and the DIRECT algorithm. Journal of Global Optimization, 36(4):597–608, Dec 2006.
  • [8] Daniel E Finkel. DIRECT optimization algorithm user guide, 2003.
  • [9] Peter I Frazier. A tutorial on bayesian optimization. arXiv preprint arXiv:1807.02811, 2018.
  • [10] Ratko Grbić, Emmanuel Karlo Nyarko, and Rudolf Scitovski. A modification of the DIRECT method for lipschitz global optimization for a symmetric function. Journal of Global Optimization, 57(4):1193–1212, 2013.
  • [11] Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The elements of statistical learning: data mining, inference, and prediction. Springer Science & Business Media, 2009.
  • [12] Robert Hooke and Terry A Jeeves. Direct search solution of numerical and statistical problems. Journal of the ACM, 8(2):212–229, 1961.
  • [13] Frank Hutter, Lars Kotthoff, and Joaquin Vanschoren. Automated machine learning: Methods, systems, challenges. Automated Machine Learning, 2019.
  • [14] Donald Jones. A taxonomy of global optimization methods based on response surfaces. Journal of Global Optimization, page 345–383, 2001.
  • [15] Donald R. Jones and Joaquim R. R. A. Martins. The DIRECT algorithm: 25 years later. Journal of Global Optimization, 2020.
  • [16] Donald R Jones, Cary D Perttunen, and Bruce E Stuckman. Lipschitzian optimization without the lipschitz constant. Journal of Optimization Theory and Applications, 79(1):157–181, 1993.
  • [17] J. Kennedy and R. Eberhart. Particle swarm optimization. In Proceedings of ICNN’95 - International Conference on Neural Networks, volume 4, 1995.
  • [18] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
  • [19] Su-In Lee, Honglak Lee, Pieter Abbeel, and Andrew Y Ng. Efficient L1 regularized logistic regression. In AAAI, volume 6, pages 401–408, 2006.
  • [20] G. Liuzzi, S. Lucidi, and V. Piccialli. Exploiting derivative-free local searches in DIRECT-type algorithms for global optimization. Computational Optimization and Applications, 65(2):449–475, Nov 2016.
  • [21] G. Liuzzi, Stefano Lucidi, and Veronica Piccialli. A direct-based approach exploiting local minimizations for the solution of large-scale global optimization problems. Computational Optimization and Applications, 45:353–375, 03 2010.
  • [22] Giampaolo Liuzzi, Stefano Lucidi, and Veronica Piccialli. A partition-based global optimization algorithm. Journal of Global Optimization, 48:113–128, 2010.
  • [23] Jonas Mockus. Bayesian approach to global optimization: theory and applications, volume 37. Springer Science & Business Media, 2012.
  • [24] John A. Nelder and Roger Mead. A simplex method for function minimization. Comput. J., 7:308–313, 1965.
  • [25] Fabian Pedregosa, Gaël Varoquaux, et al. Scikit-learn: Machine learning in Python. Journal of machine learning research, 12(Oct):2825–2830, 2011.
  • [26] J D Pinter. Global optimization in action—continuous and lipschitz optimization: Algorithms, implementations and applications. Journal of the Operational Research Society, 48(4):449–449, 1997.
  • [27] M. J. D. Powell. A direct search optimization method that models the objective and constraint functions by linear interpolation. In S. Gomez and J.P. Hennart, editors, Advances in Optimization and Numerical Analysis, pages 51–67. Springer Netherlands, 1994.
  • [28] Kenneth Price, Rainer M Storn, and Jouni A Lampinen. Differential evolution: a practical approach to global optimization. Springer, 2006.
  • [29] Jimena Tomás, Newton Spolaôr, Everton Cherman, and Maria Monard. A framework to generate synthetic multi-label datasets. Electronic Notes in Theoretical Computer Science, 302:155 – 176, 2014.
  • [30] Virginia Torczon. On the convergence of pattern search algorithms. SIAM Journal on optimization, 7(1):1–25, 1997.

A Supplementary Material

Lemma 3.1. Let ϵ>0\epsilon>0 be the positive constant used in Definition 2 and fminf_{min} be the current best function value. Let \mathcal{I} be the set of indices of all existing hyper-rectangles. For each jj\in\mathcal{I}, let l={i:diσi<djσj},b={i:diσi>djσj},e={i:diσi=djσj}\mathcal{I}_{l}=\{i\in\mathcal{I}:d_{i}\sigma_{i}<d_{j}\sigma_{j}\},\mathcal{I}_{b}=\{i\in\mathcal{I}:d_{i}\sigma_{i}>d_{j}\sigma_{j}\},\mathcal{I}_{e}=\{i\in\mathcal{I}:d_{i}\sigma_{i}=d_{j}\sigma_{j}\} and gi=fifjdiσidjσjg_{i}=\frac{f_{{\cal{H}}_{i}}-f_{{\cal{H}}_{j}}}{d_{i}\sigma_{i}-d_{j}\sigma_{j}} for all iji\neq j. The hyper-rectangle j{\cal{H}}_{j} is potentially optimal if and only if three following conditions hold:

  • (a)

    fjfif_{{\cal{H}}_{j}}\leq f_{{\cal{H}}_{i}} for every ie;i\in\mathcal{I}_{e};

  • (b)

    maxilgiminibgi;\max_{i\in\mathcal{I}_{l}}g_{i}\leq\min_{i\in\mathcal{I}_{b}}g_{i};

  • (c)

    If fmedian>fminf_{median}>f_{min} then

    ϵfminfj|fminfmedian|+djσjminibgi|fminfmedian|;\displaystyle\epsilon\leq\frac{f_{min}-f_{{\cal{H}}_{j}}}{|f_{min}-f_{median}|}+\frac{d_{j}\sigma_{j}\min_{i\in\mathcal{I}_{b}}g_{i}}{|f_{min}-f_{median}|};

    otherwise,

    fjdjσjminibgi+fmin.f_{{\cal{H}}_{j}}\leq d_{j}\sigma_{j}\min_{i\in\mathcal{I}_{b}}g_{i}+f_{min}.
  • Proof.

    First, assume that j{\cal{H}}_{j} is potentially optimal. For iei\in\mathcal{I}_{e}, the inequality fjfif_{{\cal{H}}_{j}}\leq f_{{\cal{H}}_{i}} follows directly from (3.7).

    For ili\in\mathcal{I}_{l}, from (3.7), we have

    Kfjfidjσjdiσi,K\geq\frac{f_{{\cal{H}}_{j}}-f_{{\cal{H}}_{i}}}{d_{j}\sigma_{j}-d_{i}\sigma_{i}},

    and for ibi\in\mathcal{I}_{b}, it implies that

    Kfifjdiσidjσj.K\leq\frac{f_{{\cal{H}}_{i}}-f_{{\cal{H}}_{j}}}{d_{i}\sigma_{i}-d_{j}\sigma_{j}}.

    Hence, (b) directly follows from above by taking the maximum over l\mathcal{I}_{l} and taking the minimum over b\mathcal{I}_{b}.

    By (3.8), when fminfmedianf_{min}\neq f_{median}, we have

    ϵfminfj|fminfmedian|+Kdjσj|fminfmedian|.\epsilon\leq\frac{f_{min}-f_{{\cal{H}}_{j}}}{|f_{min}-f_{median}|}+K\frac{d_{j}\sigma_{j}}{|f_{min}-f_{median}|}.

    Eq. (3.10) is a consequence of the above inequality by taking

    K=minibfifjdiσidjσj.K=\min_{i\in\mathcal{I}_{b}}\frac{f_{{\cal{H}}_{i}}-f_{{\cal{H}}_{j}}}{d_{i}\sigma_{i}-d_{j}\sigma_{j}}.

    Similar arguments hold for (3.11) when fmin=fmedianf_{min}=f_{median}.

    For the other direction of the proof, from conditions (a) and (b), we can derive (3.7). Using conditions (b) and (c) we can get (3.8).        

Theorem 3.1. Suppose that 𝐰=(1/p,,1/p)p{\bf{w}}=(1/p,\ldots,1/p)\in\mathbb{R}^{p} and ff is continuous in a neighborhood of a global optimum. Then, StepDIRECT converges to the globally optimal function value for the stepwise function ff over the bounded box defined in (1.1).

  • Proof.

    We will prove that for any distance δ>0\delta>0 to the optimal solution 𝐱{\bf{x}}^{*}, StepDIRECT will sample at least a point within a distance δ\delta of 𝐱{\bf{x}}^{*}. We will argue that the points sampled by the algorithm form a dense subset of the unit hypercube Ω\Omega.

    We note that the new hyper-rectangles are generated by splitting exiting ones into one thirds on some dimensions. The side lengths for one hyper-rectangle are in of the form 3k3^{-k} for some k0k\geq 0. Notice that as we always divide the larger sides, we have to divide those sides with length 3k3^{-k} before dividing any side of length 3(k+1)3^{-(k+1)}. After carrying out rr divisions, the hyper-rectangle will have j=mod(r,p)j=\mod(r,p) sides of length 3(k+1)3^{-(k+1)} and njn-j sides of length 3k3^{-k} with k=(rj)/pk=(r-j)/p. Hence, the diameter (the largest distance from one vertex to the other vertex) of the hyper-rectangle is given by

    (A.1) d=[j32(k+1)+(nj)32k]0.5,d=[j3^{-2(k+1)}+(n-j)3^{-2k}]^{0.5},

    and the volume is V=3(nk+j)=3rV=3^{-(nk+j)}=3^{-r}. We can see that, as the number of divisions approaches to \infty, both the diameter dd and the volume VV converge to 0.

    At the tt-th iteration of StepDIRECT, let ltl_{t} be the fewest number of partitions undergone by any hyper-rectangle. It follows that these hyper-rectangles would have the largest diameter. We will show by contradiction that

    limtlt=.\lim_{t\rightarrow\infty}l_{t}=\infty.

    If limtlt\lim_{t\rightarrow\infty}l_{t} is bounded with an upper limitation r<r<\infty, we define the following nonempty set

    r={i:Vi=3r}.\mathcal{I}_{r}=\{i:V_{i}=3^{-r}\}.

    We choose a hyper-rectangle j{\cal{H}}_{j} where jJ:=argmaxi{diσi}j\in J:=\mathop{\rm argmax}_{i}\{d_{i}\sigma_{i}\} with the best function value fjf_{{\cal{H}}_{j}} over the hyper-rectangle j{\cal{H}}_{j}. By the update rule, this hyper-rectangle will be potentially optimal as the conditions in Definition 2 are fulfilled with K~>max{K1,K2}\tilde{K}>\max\{K_{1},K_{2}\}, where

    (A.2) K1=fjfmin+ϵ|fminfmedian|djσj\displaystyle K_{1}=\dfrac{f_{{\cal{H}}_{j}}-f_{min}+\epsilon|f_{min}-f_{median}|}{d_{j}\sigma_{j}}
    K2=max{fjfidjσjdiσi:djσjdiσi}.\displaystyle K_{2}=\max\{\dfrac{f_{{\cal{H}}_{j}}-f_{{\cal{H}}_{i}}}{d_{j}\sigma_{j}-d_{i}\sigma_{i}}:d_{j}\sigma_{j}\neq d_{i}\sigma_{i}\}.

    If jrj\in\mathcal{I}_{r}, it follows that |r||\mathcal{I}_{r}| decreases by 1. Otherwise, a series of new hyper-rectangles will be generated with volumes no greater than Vj3\frac{V_{j}}{3}. We can repeat the above process for infinite times. Notice that the variability σj\sigma_{j} is bounded within [ϵσ,1][\epsilon_{\sigma},1]. After a finite time of iterations, at least one hyper-rectangle jrj\in\mathcal{I}_{r} will be selected. It follows r\mathcal{I}_{r} would eventually diminish to an empty set. This contradiction proves that limtlt=\lim_{t\rightarrow\infty}l_{t}=\infty. As a result, the points generated by StepDIRECT will be dense in the original rectangle Ω\Omega and the global convergence directly follows from the continuity assumption.