StepDIRECT - A Derivative-Free Optimization Method for Stepwise Functions
Abstract
In this paper, we propose the StepDIRECT algorithm for derivative-free optimization (DFO), in which the black-box objective function has a stepwise landscape. Our framework is based on the well-known DIRECT algorithm. By incorporating the local variability to explore the flatness, we provide a new criterion to select the potentially optimal hyper-rectangles. In addition, we introduce a stochastic local search algorithm performing on potentially optimal hyper-rectangles to improve the solution quality and convergence speed. Global convergence of the StepDIRECT algorithm is provided. Numerical experiments on optimization for random forest models and hyper-parameter tuning are presented to support the efficacy of our algorithm. The proposed StepDIRECT algorithm shows competitive performance results compared with other state-of-the-art baseline DFO methods including the original DIRECT algorithm.
1 Introduction
We introduce an optimization algorithm for solving a class of structured black-box deterministic problems, which often arise in data mining and machine learning
(1.1) |
where is a bounded hyper-rectangle, i.e., for some given bounds . We assume that is a stepwise function, whose closed-form formula is unavailable or costly to get and store. A stepwise function is a piecewise constant function over a finite number of disjoint subsets. We will point out two motivating important applications in the next section that fit into this framework: hyper-parameter tuning (HPT) for classification and optimizing tree ensemble regression.
Derivative-free optimization (DFO) has a long history and can be traced back to the deterministic direct-search (DDS) method proposed in [12]. DFO algorithms can be classified into two categories: local and global search methods. Local algorithms focus on techniques that can seek a local minimizer. Direct local algorithms find search directions by evaluating the function value directly; for example, Nelder–Mead algorithm [24] and the generalized pattern-search method [30]. Model-based algorithms construct and optimize a local surrogate model for the objective function to determine the new sample point. For instance, the radial basis function is utilized in RBFOpt [4] as the surrogate model, while polynomial models are used in [27, 3]. Due to the local flatness and discontinuous structure of the stepwise function in (1.1), a local search algorithm might easily get stuck at a bad local minimizer. We note that (1.1) can have an infinite number of local minimizers in some applications and gradients are either zero or undefined. Global search algorithms aim at finding a global solution. Methods based on Lipschitzian-based partitioning techniques for underestimating the objective include the dividing rectangles algorithm (DIRECT) [16] and branch-and-bound search [26]. Stochastic search algorithms such as particle swarm optimization (PSO) [17] and the differential evolution (DE) [28] are also considered. Global model-based approaches optimize a surrogate model, usually constructed for the entire search space, e.g., response surface methods [14]. These methods often require a large number of samples in order to obtain a high-fidelity surrogate. Recently, Bayesian optimization using Gaussian process [9, 23] are widely applied in black-box optimization, especially hyper-parameter tuning for machine learning models.
Due to the excessive number of local minimizers and the computational cost of a function evaluation in many applications such as the hyper-parameter tuning problem, we focus on a global optimization algorithm that can reduce function values quickly in a moderate number of iterations. In this work, we propose a global spatial-partitioning algorithm for solving the problem (1.1), based on the idea of the well-known DIRECT algorithm [16]. The key ingredient in our algorithm is the selection of hyper-rectangles to do partition and sample new points. We do not attempt to compute approximate gradients or build a surrogate of the objective function as in many existing methods. The name DIRECT comes from the shortening of the phrase “DIviding RECTangles”, which describes the way the algorithm partitions the feasible domain by a number of hyper-rectangles in order to move towards the optimum. An appealing feature of DIRECT is that it is insensitive to discontinuities and does not rely on gradient estimates. These characteristics are nicely suitable for solving the stepwise function (1.1).
There are two main steps in a DIRECT-type algorithm: 1) selecting a sub-region within , the so-called potentially optimal hyper-rectangle, in order to get a new sample point over the sub-region, and 2) splitting the potentially optimal hyper-rectangle. In the literature, a number of variants of DIRECT algorithms have been proposed to improve the performance of DIRECT, most of them are devoted to the first step [7, 22, 15]. There are a limited number of papers working on the second step, for example [10]. Another line of research direction for speeding up DIRECT is to incorporate a local search strategy in the framework [20]. Most of these modifications are for a general objective function, but to the best of our knowledge, very few of them have been successful to exploit the problem structure or prior knowledge on the objective function [10, 22]. For example, in [10], the authors present an efficient modification of DIRECT to optimize a symmetric function by including an ordered set in the hyper-rectangle dividing step. In [22], the authors assume that the optimal function value is known in the hyper-rectangle selection step. In this paper, we propose a new DIRECT-type algorithm to utilize the stepwise landscape of the objective function, and make contributions in both two steps. Compared with the original DIRECT and its variants, the proposed StepDIRECT differs from them in the following aspects:
-
•
We provide a new criterion for selecting potentially optimal hyper-rectangles using the local variability and the best function value.
-
•
We propose a new stochastic local search algorithm, specially designed for stepwise functions, to explore the solution space more efficiently. As a result, StepDIRECT is a stochastic sampling global optimization algorithm, where incorporating stochasticity can help it to escape from a local minimum.
-
•
When prior information on the relative importance for decision variables is available, we split the potentially optimal hyper-rectangles along the dimension with higher relative variable importance.
We can prove that the algorithm converges asymptotically to the global optimum under mild conditions. Experimental results reveal that our algorithm performs better than or on par with several state-of-the-art methods in most of the settings studied.
2 Motivating Examples
To fully see the importance of the proposed algorithm, we now show two concrete examples. Before giving a detailed description of these examples, we present a formal definition of a stepwise function as follows.
Definition 1
A function is called a stepwise function over if there is a partition of and real numbers such that for and . Here, is the indicator function.
2.1 Tree-based Ensemble Regression Optimization
A common approach for the decision-making problem based on data-driven tools is to build a pipeline from historical data, to predictive model, to decisions. A two-stage solution is often used, where the prediction and the optimization are carried out in a separate manner [5]. Firstly, a machine learning model is trained to learn the underlying relationship between the controllable variables and the output. Secondly, the trained model is embedded in the downstream optimization to produce a decision. We assume that the regression model estimated from the data is a good representation of the complex relationship. In this paper, we consider the problem of optimizing a tree ensemble model such as random forests and boosting trees, where the predictive model has been learned from historical data.
The tree ensemble model combines predictions from multiple decision trees. A decision tree uses a tree-like structure to predict the outcome for an input feature vector . The -th regression tree in the ensemble model has the following form
(2.2) |
where represent a partition of feature space. We can see that is a stepwise function. The tree ensemble regression function outputs predictions by taking the weighted sum of multiple decision trees as
(2.3) |
where is the weight for the decision tree . We assume that parameters and have been learned from data, and we use StepDIRECT to optimize . As we can see that is constant over each sub-region ; hence is a stepwise function. Formally, we have the following theorem.
Theorem 2.1
Regression functions for decision tree ensemble methods using a constant value for prediction at each leaf node including random forest and gradient boosting are stepwise functions.
We note that for this type of regression functions, we might get additional information about the objective function such as variable importance for input features of random forests [1].
2.2 Hyper-parameter Tuning for Classification
In machine learning models, we often need to choose a number of hyper-parameters to get a high prediction accuracy for unseen samples [13]. For example, in training an -regularized logistic regression classifier, we tune the sparsity parameter [19]. A common practice is to split the entire dataset into three subsets: training data, validation data, and test data [11]. For a given set of hyper-parameters, we train the model on the training data, then evaluate the model performance on the validation data. Some widely-used performance metrics include accuracy, precision, recall, -score, and AUC (Area Under the Curve). The goal is to tune hyper-parameters for the model to maximize the performance on the validation data. The task can be formulated as a black-box optimization problem.
First, we start with a binary classifier. Suppose that we are given the training samples and validation samples where and . For a fixed set of model parameters , a classification model has been learned based on the training data. When the performance measure is accuracy, the HPT problem is to determine that maximizes the test accuracy
where is from the validation data. We can see that for any ; hence we expect that the landscape of this function should have stepwise behavior. As an example, we plot in Figure 1 the landscape for the HPT logistic regression by tuning the parameter (denoted by in the -axis) for training the a1a dataset [2].

The target function for other metrics can also take only a finite number of values. The observation still holds true for a multi-label classifier.
3 StepDIRECT for Stepwise Functions
The StepDIRECT algorithm begins the optimization by transforming the domain of the problem (1.1) linearly into the unit hyper-cube. Therefore, we assume for the rest of the paper that
(3.4) |
In each iteration, StepDIRECT consists of three main steps. First, we identify a set of potentially optimal hyper-rectangles based on a new criterion. We expect that the hyper-rectangles have a high chance to contain a global optimal solution. The second step is to perform a local search over the potentially optimal hyper-rectangles. Thirdly, we divide the selected hyper-rectangles into smaller hyper-rectangles.
We improve over the general DIRECT-type approaches by: 1) using a different heuristic to select which hyper-rectangle to split, which takes into account the local function variability; 2) using a local search (randomized directional search) to choose a high-quality point as a representative for each hyper-rectangle. Both two proposed strategies make use of the stepwise structure of the objective function.
At the -th iteration, let be the set of all generated hyper-rectangles in the partition of , where for some bounds and . We let denote the best function value of over (by evaluating at the sampling points in the sub-region ). and are the best function value and the best feasible solution over , respectively. We use to count the number of function evaluations and is the maximal number of function evaluations. We present our main algorithm StepDIRECT in Algorithm 1.
3.1 Initialization Step
We follow the first step of the original DIRECT for initialization [8]. In this step, Algorithm 1 starts with finding and divides the hyper-cube by evaluating the function values at points , where and is the -th unit vector. The idea of DIRECT is to select a hyper-rectangle with a small function value in a large search space; hence let us define
and the dimension with the smallest is partitioned into thirds. By doing so, are the center of the newly generated hyper-rectangles. We initialize the sets , values for every , and update and corresponding .
3.2 Potentially Optimal Hyper-rectangles
In this subsection, we propose a new criterion for StepDIRECT to select the next potentially optimal hyper-rectangles, which should be divided in this iteration. In the original DIRECT algorithm, every hyper-rectangle is represented by a pair , where is the function value estimated at the centre of and is the distance from the center of hyper-rectangle to its vertices. The criterion to select hyper-rectangles, i.e., potentially optimal hyper-rectangles, for further divided is based on a score computed from . A pure local strategy would select the hyper-rectangle with the smallest value for , while a pure global search strategy would choose one of the hyper-rectangles with the biggest value . The main idea of the DIRECT algorithm is to balance between the local and global search, which can be achieved by using a score weighting the two search strategies: for some . The original definition for potentially optimal hyper-rectangles for DIRECT [16] is based on two conditions:
(3.5) | ||||
(3.6) |
for some .
An example for identifying these potentially optimal hyper-rectangles by the original DIRECT is given in Figure 2. Each dot in a two dimensional space represents a hyper-rectangle, three red dots with the smallest value for for each and a significant improvement (i.e., ) are considered as potentially optimal.

The proposed StepDIRECT searches locally and globally by dividing all hyper-rectangles that meet the criteria in Definition 2. A hyper-rectangle is now represented by a triple , in which we introduce new notations for and . We use a higher quality value returned by a local search as a representative for each hyper-rectangle, and a flatness measurement in the proposed criterion.
Definition 2
(potentially optimal hyper-rectangle) Let be a small positive constant and be the current best function value over . Suppose is the best function value over the hyper-rectangle . A hyper-rectangle is said to be potentially optimal if there exists such that
(3.7) | ||||
(3.8) |
where quantifies the local variability of on the hyper-rectangle (defined in (3.9)) and is the median of all function values in history.
In Eq. (3.7), we introduce a new notation to measure the local landscape of the stepwise function. Furthermore, we replace as in the original DIRECT by , which is computed with the help of local search. The value for generated during Steps a and c is better estimating the global solution of over because . In Eq. (3.8), in order to eliminate the sensitivity to scaling issue, is replaced by the difference as suggested in [7].
Now we discuss how to compute . At the -th iteration of the StepDIRECT algorithm, for hyper-rectangle with center and diameter , we define its set of neighborhood hyper-rectangles as
for some . Then, can be further divided into two disjoint subsets and such that
The local variability estimator for hyper-rectangle is defined as
(3.9) |
where is a small positive number to prevent when . In our experiments, we set and as default. The meaning of can be interpreted as follows. For each hyper-rectangle , a large value for indicates that it is likely to have more different function values in the neighbourhood of the center , which requires more local exploration. We propose to include the hyper-rectangle to the set of potentially optimal hyper-rectangles.
By Definition 2, we can efficiently find potentially optimal hyper-rectangles based on the following lemma.
Lemma 3.1
Let be the positive constant used in Definition 2 and be the current best function value. Let be the set of indices of all existing hyper-rectangles. For each , let and for all . The hyper-rectangle is potentially optimal if and only if three following conditions hold:
-
(a)
for every
-
(b)
-
(c)
If then
(3.10) otherwise,
(3.11)
We note that all proofs are delegated to the supplementary material. Definition 2 differs from the potentially optimal hyper-rectangle definition proposed in [16] (i.e., (3.5),(3.6)) in the following three aspects:
- •
-
•
(3.8) uses to remove sensitivity to the linear and additive scaling and improve clustering of sample points near optimal solutions.
-
•
can be different from , i.e., .
3.3 Dividing Potentially Optimal Hyper-rectangles
Once a hyper-rectangle has been identified as potentially optimal, StepDIRECT divides this hyper-rectangle into smaller hyper-rectangles. We will take into account the side length and the variable importance measure of each dimensions.
In tree ensemble regression optimization, we can obtain the variable importance which indicate the relative importance of different features [1]. In general, the variable importance relates to a significant function value change if we move along this direction, and also the number of discontinuous points along the coordinate. As a result, we tend to make more splits along the direction with higher variable importance. Formally, let be the normalized variable importance with and the length of the hyper-rectangle be , then we define
(3.12) |
as the relative importance of all coordinates for the selected hyper-rectangle. We choose the coordinate with the highest value to divide the hyper-rectangle into thirds.
If no variable importance measure is provided, then we take . The dividing procedure is the same as the original DIRECT by splitting the dimensions with the largest side length.
3.4 Local Search for Stepwise Function
In this subsection, we introduce a randomized local search algorithm designed for minimizing stepwise function over a bounded box
It has been known that a DIRECT-type algorithm has a good ability to locate promising regions of the feasible space, and a good local search procedure can help to quickly converge to the global optimum [21, 20]. Since the function values over the neighborhood of a point are almost surely constant for a stepwise function, existing local searches based on a local approximation of the objective function might fail to move to a better feasible solution. Hence, we propose a novel local search method, shown in Algorithm 2. Different from the classical trust-region methods, Algorithm 2 will increase the stepsize when no better solution is discovered in the current iteration, i.e., . This change is motivated by the stepwise landscape of the objective function.
We provide two different options to generate the search directions :
-
(a)
By coordinate strategy, for each axis , we take two directions and with a probability calculated from variable importance.
-
(b)
By random sampling from the unit sphere in .
The first option is more suitable when the discontinuous boundaries are in parallel to the axes, for example, ensemble tree regression functions. The second strategy works for general stepwise functions when the boundaries for each level set is not axis-parallel. We store the feasible sampled points with their function values generated during local search in order to update the best function value for each hyper-rectangle for a new partition.
3.5 Global Convergence of StepDIRECT
Now, we are able to provide the global convergence result for StepDIRECT. We assume that for simplicity and the following theorem is still valid as long as for every .
Theorem 3.1
Suppose that and is continuous in a neighborhood of a global optimum. Then, StepDIRECT converges to the globally optimal function value for the stepwise function over the bounded box defined in (1.1).
4 Numerical Experiments
In this section, we test the performance of StepDIRECT algorithm on two problems: optimization for random forest regression function and hyper-parameter tuning for classification. As explained in Section 2, we need to minimize a stepwise target function. We denote StepDIRECT-0 by a variant of StepDIRECT when we skip Local_Search in Step a of Algorithm 1.
4.1 Optimization for Random Forest Regression Function
We consider the minimization for Random Forest regression function over a bounded box constraint. We used the boston, diabetes, mpg and bodyfat data sets from the UCI Machine Learning Repository [6]. Table 1 provides the details of these four data sets.
Dataset | boston | diabetes | mpg | bodyfat |
---|---|---|---|---|
506 | 768 | 234 | 252 | |
13 | 8 | 11 | 14 |
We train the random forest regression function on these data sets with 100 trees and use default settings for other parameters in scikit-learn package [25]. For comparison, we run the following optimization algorithms: DIRECT [16]111https://scipydirect.readthedocs.io/en/latest/, DE [28]222https://docs.scipy.org/doc/scipy/reference/index.html, PSO [17]333https://pypi.org/project/pyswarms/, RBFOpt [4]444https://projects.coin-or.org/RBFOpt, StepDIRECT-0, and StepDIRECT. For both StepDIRECT-0 and StepDIRECT in Algorithm 1, we set the maximum number of function evaluations . The same function evaluation limit is used for DIRECT. In Algorithm 2, we select , , and . For all other algorithms, we use the default settings. We run all algorithms for 20 times and report the mean and standard deviation results for final objective function values (denoted by “obj”) and running times in seconds (denoted by “time”) in Table 2.
Algorithms | boston | diabetes | mpg | bodyfat | |
---|---|---|---|---|---|
DIRECT | obj | 32.60 | 77.66 | 10.55 | 1.31 |
( 0.00) | ( 0.00) | ( 0.00) | ( 0.00) | ||
time | 8.21 | 9.14 | 1.52 | 6.15 | |
( 0.01) | ( 0.06) | ( 0.04) | (0.02) | ||
StepDIRECT-0 | obj | 28.66 | 75.8 | 10.37 | 1.27 |
( 0.00) | ( 0.00) | ( 0.00) | ( 0.00) | ||
time | 7.59 | 9.41 | 1.66 | 6.03 | |
( 0.01) | ( 0.16) | ( 0.09) | ( 0.03) | ||
StepDIRECT | obj | 28.35 | 58.68 | 9.85 | 1.26 |
(0.02) | ( 1.04) | ( 0.03) | ( 0.01) | ||
time | 7.87 | 10.74 | 1.88 | 6.63 | |
( 0.23) | ( 0.15) | ( 0.17) | ( 0.26) | ||
DE | obj | 28.40 | 69.71 | 9.82 | 1.29 |
( 0.08) | ( 1.72) | ( 0.01) | ( 0.01) | ||
time | 15.22 | 23.86 | 8.85 | 3.34 | |
( 0.79) | ( 0.45) | ( 0.70) | ( 0.29) | ||
PSO | obj | 28.41 | 80.55 | 9.87 | 1.28 |
( 0.08) | ( 3.82) | ( 0.06) | ( 0.05) | ||
time | 30.60 | 34.67 | 34.47 | 29.49 | |
( 0.49) | ( 11.91) | ( 0.88) | ( 0.14) | ||
RBFOpt | obj | 28.56 | 87.34 | 10.40 | 1.33 |
( 0.02) | ( 3.82) | ( 0.00) | ( 0.04) | ||
time | 14.70 | 11.95 | 8.70 | 12.83 | |
( 2.85) | ( 1.17) | ( 0.08) | ( 1.83) |
From Table 2, we see that the function values returned by StepDIRECT-0 are better than those of the original DIRECT. It illustrates the benefit of our proposed strategy for identifying potentially optimal hyper-rectangles and dividing hyper-rectangles which efficiently exploits the stepwise function structure. For DIRECT and StepDIRECT-0, their objective function outputs do not change for different runs since they are deterministic algorithms.
To further see the advantage of local search incorporated into StepDIRECT-0, in StepDIRECT the local search is initialized with the best point in the potentially optimal hyper-rectangle and runs with search directions randomly generated by the coordinate strategy. Compared with StepDIRECT-0, we notice that the StepDIRECT algorithm achieved lower objective function values. From Table 2, StepDIRECT shows the best overall performance in terms of solution quality except for mpg dataset. By embedding the local search, we can significantly improve the solution quality. In general, the proposed algorithm runs faster than other baseline methods DE, PSO, and RBFOpt, except for the run time of bodyfat dataset, DE outperforms StepDIRECT.
4.2 Hyper-parameter Tuning for Classification
We tune the hyper-parameters for: 1) multi-class classification with linear support vector machines (SVM) and logistic regression, and 2) imbalanced binary classification with RBF-kernel SVM [11]. We use three datasets: MNIST from [18], PenDigits from the UCI Machine Learning Repository [6], and a synthetic dataset Synth generated by the Mldatagen generator [29]. For all datasets, we set the ratio among training, validation and test data partitions as and report the performance on the test dataset.
4.2.1 Multi-class Classification
For multi-class classification problems with the number of classes , there are two widely-used approaches: “one-vs-all” and “one-vs-one” strategies.
One-vs-all classification: In this experiment, we tune the hyper-parameters for multi-class support vector machines and logistic regression. For multi-class classification problems, we take the “one-vs-all” approach to turning a binary classifier into a multi-class classifier. For each class and a given hyper-parameter , the one-vs-all SVM solves the following problem for the dataset
(4.13) | |||||
s.t. | |||||
The class of each point is determined by
(4.14) |
Different from many default implementations by taking the same value for all , we allow them to take different values because the margin between different classes may vary significantly from each other. The one-vs-all logistic regression follows the same approach by replacing (4.13) with the binary logistic regression classifier.
We search hyper-parameters for classifiers in the log space of . The number of hyper-parameters for each dataset is given in Table 3, which is denoted by “”. For comparison, we run the following algorithms: Random Search (RS), DIRECT, and Bayesian Optimization (BO)555https://rise.cs.berkeley.edu/projects/tune/. The widely used Grid Search (GS) is not considered here because GS is generally not applicable when the search space dimension is beyond 5. For all algorithms, the computational budget is in terms of the number of training base classifiers. The results for one-vs-all SVM and one-vs-all logistic regression are shown in Tables 3 and 4, respectively. We can observe that StepDIRECT gives the best test accuracy for these data sets. Compared with the random search RS, StepDIRECT improves the test accuracy . We notice that the original DIRECT algorithm makes little improvement and often gets stuck in the local region until consuming all running budgets, while the StepDIRECT algorithm can make consistent progress by balancing the local exploration and global search.
Dataset | RS | DIRECT | StepDIRECT | BO | ||
Synth | 3 | 3 | 74.8% | 75.3% | 76.8% | 75.0% |
PenDigits | 10 | 10 | 94.0% | 94.2% | 95.2% | 94.9% |
MNIST | 10 | 10 | 91.3% | 91.8% | 92.2% | 92.0% |
Dataset | RS | DIRECT | StepDIRECT | BO | ||
---|---|---|---|---|---|---|
Synth | 3 | 3 | 74.5% | 75.2% | 75.5% | 75.5% |
PenDigits | 10 | 10 | 94.8% | 95.0% | 95.3% | 95.1% |
MNIST | 10 | 10 | 92.0% | 92.2% | 92.6% | 92.3% |
One-vs-one classification: For each pair of classes and , we need to tune an associated model parameter . The one-vs-one SVM solves the following problem
(4.15) | |||||
s.t. | |||||
for all pairs . There are hyper-parameters to be tuned. We use the following voting strategy: if says is in the -th class, then the vote for the -th class is added by one. Otherwise, the vote for the -th class is increased by one. Then, the class of has the largest vote. Similar to the one-vs-all case, we search different hyper-parameters for all pair classifiers (). For all algorithms, the budget is in terms of training a base classifier. The results are shown in Tables 5 and 6 for one-vs-one SVM and one-vs-one logistic regression, respectively. Overall, StepDIRECT performs the best in many cases compared to the base-line algorithms. Especially, it outperforms the Bayesian optimization algorithm BO in 4 out of 6 cases.
Dataset | RS | DIRECT | StepDIRECT | BO | ||
Synth | 3 | 3 | 78.0% | 78.5% | 79.5% | 78.8% |
PenDigits | 10 | 45 | 97.9% | 98.2% | 98.8% | 98.6% |
MNIST | 10 | 45 | 91.5% | 92.2% | 93.0% | 93.2% |
Dataset | RS | DIRECT | StepDIRECT | BO | ||
---|---|---|---|---|---|---|
Synth | 3 | 3 | 77.5% | 77.75% | 78.25% | 78.5% |
PenDigits | 10 | 45 | 96.8% | 97.8% | 98.4% | 97.4% |
MNIST | 10 | 45 | 93.3% | 93.8% | 94.1% | 93.7% |
4.2.2 Imbalanced Binary Classification
In the second example, we consider to tune parameters of RBF-SVM with regularization , kernel width , and class weight for the imbalanced binary classification problems. In this setting, we compare the performance of StepDIRECT with DIRECT, RS, BO, and GS in tuning a three-dimensional hyper-parameter to achieve a high test accuracy.
In our experiments, we used five binary classification datasets, as shown in Table 7, which can be obtained from LIBSVM package [2].
Dataset | |||
---|---|---|---|
fourclass | 2 | 862 | 1.8078 |
diabetes | 8 | 768 | 1.8657 |
statlog | 24 | 1,000 | 2.3333 |
svmguide3 | 22 | 1,284 | 3.1993 |
ijcnn1 | 22 | 141,691 | 9.2643 |
Dataset | RS | GS | DIRECT | StepDIRECT | BO |
diabetes | 79.8% | 81.2% | 79.2% | 81.8% | 81.3% |
fourclass | 100.0% | 80.9% | 100.0% | 100.0% | 100.0% |
statlog | 77.5% | 78.5% | 80.5% | 81.6% | 81.0% |
svmguide3 | 83.5% | 83.9% | 84.3% | 84.3% | 82.9% |
ijcnn1 | 97.6% | 98.3% | 94.6% | 98.2% | 98.3% |
For all algorithms, the feasible set is chosen as , and . The search is conducted in the log-scale. For the GS algorithm, we uniformly select 5 candidates for each parameter. For a fair comparison, we set the budget as 125 for the limit of number of function evaluations for all algorithms. Table 8 shows the test accuracy for the experiment.
As we can see from Table 8, StepDIRECT is always the top performer when compared with DIRECT and RS algorithms. Furthermore, the proposed algorithm achieves competitive performance with BO.
5 Conclusion
In this paper, we have proposed the StepDIRECT algorithm for solving the black-box stepwise function, which can exploit the special structure of the problem. We have introduced a new definition to identify the potentially optimal hyper-rectangles and divide the hyper-rectangles. A stochastic local search is embedded to improve solution quality and speed up convergence. The global convergence is proved and numerical results on two practical machine learning problems show the state-the-art-performance of algorithm. In the future, we plan to combine the ideas from StepDIRECT and Bayesian Optimization, and develop practical algorithms for tuning discrete hyper-parameters in machine learning and deep neural networks.
References
- [1] Leo Breiman. Random forests. Machine learning, 45(1):5–32, 2001.
- [2] Chih-Chung Chang and Chih-Jen Lin. LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2:1–27, 2011.
- [3] A. R. Conn, K. Scheinberg, and Ph. L. Toint. Recent progress in unconstrained nonlinear optimization without derivatives. Mathematical Programming, 79:397–414, 1997.
- [4] Alberto Costa and Giacomo Nannicini. RBFOpt: an open-source library for black-box optimization with costly function evaluations. Mathematical Programming Computation, 10(4):597–629, 2018.
- [5] Emir Demirović, Peter Stuckey, James Bailey, Jeffrey Chan, Chris Leckie, Kotagiri Ramamohanarao, and Tias Guns. An investigation into prediction+optimisation for the knapsack problem. In CPAIOR, pages 241–257, 2019.
- [6] Dheeru Dua and Casey Graff. UCI machine learning repository, http://archive.ics.uci.edu/ml, 2021.
- [7] D. E. Finkel and C. T. Kelley. Additive scaling and the DIRECT algorithm. Journal of Global Optimization, 36(4):597–608, Dec 2006.
- [8] Daniel E Finkel. DIRECT optimization algorithm user guide, 2003.
- [9] Peter I Frazier. A tutorial on bayesian optimization. arXiv preprint arXiv:1807.02811, 2018.
- [10] Ratko Grbić, Emmanuel Karlo Nyarko, and Rudolf Scitovski. A modification of the DIRECT method for lipschitz global optimization for a symmetric function. Journal of Global Optimization, 57(4):1193–1212, 2013.
- [11] Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The elements of statistical learning: data mining, inference, and prediction. Springer Science & Business Media, 2009.
- [12] Robert Hooke and Terry A Jeeves. Direct search solution of numerical and statistical problems. Journal of the ACM, 8(2):212–229, 1961.
- [13] Frank Hutter, Lars Kotthoff, and Joaquin Vanschoren. Automated machine learning: Methods, systems, challenges. Automated Machine Learning, 2019.
- [14] Donald Jones. A taxonomy of global optimization methods based on response surfaces. Journal of Global Optimization, page 345–383, 2001.
- [15] Donald R. Jones and Joaquim R. R. A. Martins. The DIRECT algorithm: 25 years later. Journal of Global Optimization, 2020.
- [16] Donald R Jones, Cary D Perttunen, and Bruce E Stuckman. Lipschitzian optimization without the lipschitz constant. Journal of Optimization Theory and Applications, 79(1):157–181, 1993.
- [17] J. Kennedy and R. Eberhart. Particle swarm optimization. In Proceedings of ICNN’95 - International Conference on Neural Networks, volume 4, 1995.
- [18] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
- [19] Su-In Lee, Honglak Lee, Pieter Abbeel, and Andrew Y Ng. Efficient L1 regularized logistic regression. In AAAI, volume 6, pages 401–408, 2006.
- [20] G. Liuzzi, S. Lucidi, and V. Piccialli. Exploiting derivative-free local searches in DIRECT-type algorithms for global optimization. Computational Optimization and Applications, 65(2):449–475, Nov 2016.
- [21] G. Liuzzi, Stefano Lucidi, and Veronica Piccialli. A direct-based approach exploiting local minimizations for the solution of large-scale global optimization problems. Computational Optimization and Applications, 45:353–375, 03 2010.
- [22] Giampaolo Liuzzi, Stefano Lucidi, and Veronica Piccialli. A partition-based global optimization algorithm. Journal of Global Optimization, 48:113–128, 2010.
- [23] Jonas Mockus. Bayesian approach to global optimization: theory and applications, volume 37. Springer Science & Business Media, 2012.
- [24] John A. Nelder and Roger Mead. A simplex method for function minimization. Comput. J., 7:308–313, 1965.
- [25] Fabian Pedregosa, Gaël Varoquaux, et al. Scikit-learn: Machine learning in Python. Journal of machine learning research, 12(Oct):2825–2830, 2011.
- [26] J D Pinter. Global optimization in action—continuous and lipschitz optimization: Algorithms, implementations and applications. Journal of the Operational Research Society, 48(4):449–449, 1997.
- [27] M. J. D. Powell. A direct search optimization method that models the objective and constraint functions by linear interpolation. In S. Gomez and J.P. Hennart, editors, Advances in Optimization and Numerical Analysis, pages 51–67. Springer Netherlands, 1994.
- [28] Kenneth Price, Rainer M Storn, and Jouni A Lampinen. Differential evolution: a practical approach to global optimization. Springer, 2006.
- [29] Jimena Tomás, Newton Spolaôr, Everton Cherman, and Maria Monard. A framework to generate synthetic multi-label datasets. Electronic Notes in Theoretical Computer Science, 302:155 – 176, 2014.
- [30] Virginia Torczon. On the convergence of pattern search algorithms. SIAM Journal on optimization, 7(1):1–25, 1997.
A Supplementary Material
Lemma 3.1. Let be the positive constant used in Definition 2 and be the current best function value. Let be the set of indices of all existing hyper-rectangles. For each , let and for all . The hyper-rectangle is potentially optimal if and only if three following conditions hold:
-
(a)
for every
-
(b)
-
(c)
If then
otherwise,
Theorem 3.1. Suppose that and is continuous in a neighborhood of a global optimum. Then, StepDIRECT converges to the globally optimal function value for the stepwise function over the bounded box defined in (1.1).
-
Proof.
We will prove that for any distance to the optimal solution , StepDIRECT will sample at least a point within a distance of . We will argue that the points sampled by the algorithm form a dense subset of the unit hypercube .
We note that the new hyper-rectangles are generated by splitting exiting ones into one thirds on some dimensions. The side lengths for one hyper-rectangle are in of the form for some . Notice that as we always divide the larger sides, we have to divide those sides with length before dividing any side of length . After carrying out divisions, the hyper-rectangle will have sides of length and sides of length with . Hence, the diameter (the largest distance from one vertex to the other vertex) of the hyper-rectangle is given by
(A.1) and the volume is . We can see that, as the number of divisions approaches to , both the diameter and the volume converge to 0.
At the -th iteration of StepDIRECT, let be the fewest number of partitions undergone by any hyper-rectangle. It follows that these hyper-rectangles would have the largest diameter. We will show by contradiction that
If is bounded with an upper limitation , we define the following nonempty set
We choose a hyper-rectangle where with the best function value over the hyper-rectangle . By the update rule, this hyper-rectangle will be potentially optimal as the conditions in Definition 2 are fulfilled with , where
(A.2) If , it follows that decreases by 1. Otherwise, a series of new hyper-rectangles will be generated with volumes no greater than . We can repeat the above process for infinite times. Notice that the variability is bounded within . After a finite time of iterations, at least one hyper-rectangle will be selected. It follows would eventually diminish to an empty set. This contradiction proves that . As a result, the points generated by StepDIRECT will be dense in the original rectangle and the global convergence directly follows from the continuity assumption.