Generative Evolutionary Strategy for Black-box Optimization

Changhwi Park
Samsung Electronics DS, Korea
[email protected]
Another email address : [email protected]

Abstract

Numerous scientific and technological challenges arise in the context of optimization, particularly, black-box optimization within high-dimensional spaces presents significant challenges. Recent investigations into neural network-based black-box optimization have shown promising results. However, the effectiveness of these methods in navigating high-dimensional search spaces remains limited. In this study, we propose a black-box optimization method that combines an evolutionary strategy (ES) with a generative surrogate neural network (GSN) model. This integrated model is designed to function in a complementary manner, where ES addresses the instability inherent in surrogate neural network learning associated with GSN models, and GSN improves the mutation efficiency of ES. Based on our experimental findings, this approach outperforms both classical optimization techniques and standalone GSN models.

1 Introduction

Black-box optimization plays a vital role in both science and technology; however, it has long been an unresolved problem particularly for high dimensional problems. While low-dimensional problems, which have dimensions lesser than 100, can be optimized easily, high-dimensional optimization problems pose more significant challenges. In specific cases, such as convex functions, classical algorithms such as Evolution Strategies (ES) [7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20] and others [4, 5, 6] can efficiently tackle high-dimensional optimization problems. However, their efficiency tends to decline rapidly when faced with general black-box problems characterized by high dimensionality and non-convexity.

Furthermore, in high-dimensional optimization problems, the number of function calls inevitably grows proportionally with the dimension size. Consequently, maintaining $\mathcal{O}(N)$ time complexity is crucial in preventing the optimization process from failing owing to rapidly increasing computation time. In this context, algorithms such as the Bayesian optimization [1, 2, 3], which exhibit non-linear complexity, are at a significant disadvantage. Conversely, neural networks offer a promising solution to this problem. The field of artificial intelligence has demonstrated their considerable benefits in managing data within high-dimensional spaces, such as images or language models, while preserving linear time complexity.

Recently, GSN-based approaches, inspired by Generative Adversarial Networks (GANs) [30, 31, 32], have emerged to tackle the black-box optimization problem, offering a novel solution for high-dimensional, non-convex problems. However, in contrast to GANs, Generative Surrogate Neural networks (GSNs) face a significant challenge with learning stability in the surrogate model, and the performance of GSN-like algorithms remains limited to just hundreds of dimensions.

Hence, addressing the training instability problem is crucial for enhancing the performance of GSN-based algorithms[21, 29]. Our research aligns with this perspective, and in this work, we introduce a method called Generative Evolutionary Optimization (GEO). GEO arises from the cooperative interaction between two linear complexity algorithms, ES and GSN. Furthermore, ES contributes to the stability of the surrogate network training for GSN while GSN enhances the mutation efficiency of ES, leading to their complementary functioning.

In this study, we designed an algorithm to accomplish five goals: optimizing non-convex, high-dimensional, multi-objective, and stochastic target functions, while preserving $\mathcal{O}(N)$ complexity.

In the following chapters, we explore GEO’s design and the methodologies used in addressing its most significant challenge: training instability. The Related works chapter discusses two prior works, Local Generative Surrogate Optimization (L-GSO) [21] and Evolutionary Generative Adversarial Network (EGAN) [22], which serve as the foundation for GEO’s core concepts. In addition, emphasis on the importance of the generator network is also discussed. The Methods chapter, we combines ideas from L-GSO and EGAN studies to clarify the aspects of GSN that require improvement and how they can be addressed. The Results chapter presents the findings from the test function experiments. Because GEO is a combination of ES and GSN, we assume a close relationship with ES, and therefore compare it against non-convex test functions commonly used in ES, such as ZDT [42], Styblinski-Tang [41], Rosenbrock [39, 40], Rastrigin [35, 36, 37], and Ackley [38] test functions. The experimental results show that GEO outperforms traditional ES and GSN as dimensionality increases, thus enabling optimizations of approximately 10,000 dimensions.

As mentioned earlier, we excluded non-linear algorithms from the comparison because our aim is to maintain $\mathcal{O}(N)$ complexity. This makes a direct comparison with Bayesian optimization, another significant branch of black-box optimization, difficult. Consequently, the issue is revisited in the Conclusion chapter.

2 Related works

Before delving into the structure of GEO, we will first introduce some related works, focusing on GSN and GAN algorithms. By examining earlier optimization approaches that utilized neural networks, we can gain valuable insights. As GEO is inspired by an L-GSO study (a type of GSN) and an EGAN study (a type of GAN), we will provide a brief overview of both algorithms, discussing their strengths and weaknesses for a better understanding.

2.1 L-GSO

Local Generative Surrogate Optimization (L-GSO) is a type of GSN that approaches the problem using a local surrogate network and a local gradient. To better understand this, let’s think of a situation where we need to optimize a target function $F(x)$ within an optimization space $x\in X$ and find the best point $x$ . If we identify $x_{0}$ as the optimal point at some stage, L-GSO samples the local space around $x_{0}$ and calculates pairs $[x^{\prime},F(x^{\prime})]$ , where $x^{\prime}=x_{0}+\epsilon$ and $|\epsilon|<<1$ . It is important that $\epsilon$ is sufficiently small so that we only sample within a space close to $x_{0}$ . From this data, L-GSO trains a surrogate network, $C$ , which acts as a local surrogate model with information around $x_{0}$ and can generate a local gradient. After training the surrogate network, the generator network $G$ is trained using $C$ . Let $p=C(G(z))$ , where $x=G(z)$ and $z$ is an input seed. $G$ is trained with a loss function $\mp p$ that either increases or decreases the prediction $p$ . Finally, the trained generator suggests a search point $x$ , and the iterative process continues.

This way, $G$ is trained to be a generator that produces optimal (or near-optimal) points $x$ , assuming that $C$ simulates the local distribution accurately enough. Meanwhile, as the data points are focused within a localized area, the surrogate network can benefit from a stable training data region, which enables accurate gradient prediction. Another advantage is that the data information is retained in the surrogate network $C$ , allowing lesser amount of data generation required when predicting the local gradient at new points.

Although the approach of utilizing GSNs in L-GSO is quite innovative, it does have some limitations. The primary constraint is that it is only applicable to single-objective function problems. The optimal point $x_{0}$ mentioned earlier is just a single point. However, if the optimal points (the Pareto front) consist of multiple points, the distance between these optimal points and the data sampling space (which the surrogate network must learn) will no longer be local. This undermines the fundamental premise of L-GSO. Hence, the challenge is that the optimal point of an $n$ -objective function features a Pareto front in $n-1$ dimensions. For example, in a two-objective function, the Pareto front forms a line, typically containing infinitely many non-dominated points that are distant from one another. Consequently, the locality of L-GSO becomes unsuitable for $n$ -objective problems that require non-dominated sorting [58, 59].

The second problem with this algorithm is that its performance may be significantly influenced by the interaction between the hyperparameters and the test function. Specifically, the combination of the sampling hypercube size $\epsilon$ and the test function characteristics can significantly impact the algorithm’s optimization performance. For instance, if the test function is convex, estimating the local gradient with any $\epsilon$ is generally not a problem; however, this can often be challenging for non-convex functions. Around the local optimum, if $\epsilon$ is smaller than the size of the localized well, the algorithm can hardly escape the local optimum. Conversely, if $\epsilon$ is too large, the local gradient cannot be accurately estimated. The interplay among the type of test function, the location of the local optimum point, and $\epsilon$ is substantial, making it difficult to determine the appropriate value of $\epsilon$ for the test function in advance; thus posing a disadvantage for black-box problems.

Conclusively, L-GSO effectively handles high-dimensional spaces using surrogate networks and improves the stability of surrogate network training through locality. However, it is clear that non-dominated sorting for multi-objective problems is not feasible, and the relationship between the hyperparameter $\epsilon$ and the test function might be too strong.

2.2 EGAN

The Evolutionary Generative Adversarial Network (EGAN) integrates ES to improve GAN performance. Although this method does not focus on black-box optimization, its algorithmic structure is similar to GEO.

Usually, GANs are trained with one generator and one discriminator (critic) network, which alternate. In EGAN, a scenario is considered where there is only one discriminator network; however, with multiple generator networks. The main idea of this study is to rank the generator networks using an evolutionary strategy and keep only the suitable networks. By using the prediction of discriminator $C$ as the fitness score, the generators $G$ that can increase $C(G(z))$ at each iteration survive. This process incorporates ES into the GAN, and it has been established that the introduction of ES reduces mode collapsing, which is a common issue in GANs.

Although EGAN is not directly related to optimization problems, we can gain valuable insights into improving GSN based on EGAN’s concepts. Because the generator and surrogate network structure in GSN is similar to the generator and discriminator network structure in GAN, we can adopt EGAN’s strategy in GSN. This forms the core basis of our research, GEO. However, the main difference between the two algorithms is that GAN operates without evolution, whereas GSN typically diverges and fails when it consists only of a single generator and a surrogate network. In the Methods section, we discuss how the working mechanism of GSN can change owing to the introduction of ES.

2.3 GLOnet

Global Optimization of dielectric metasurfaces using a physics-driven neural network (GLOnet) [23, 24, 25, 26, 27] is a study to optimize devices in electromagnetic systems. The study investigates the optimization of device structures within a specific electromagnetic system using neural network techniques.

In this study, $x$ serves as device design parameters, and the goal is to find the optimal value of $x$ that maximizes the objective value $F(x)$ of the simulator $F$ . The algorithm finds $x=G(z)$ , where $z$ is the input seed, and the generator $G$ is trained by backpropagating the gradient from the simulator. Technically, this is not a black-box optimization, as it receives the analytic gradient information directly from the simulator. Therefore, a surrogate network is not required.

An important implication of the GLOnet study is the necessity for a deep generator network. One might assume that since we have a gradient, we can optimize x through direct gradient descent. However, this study demonstrates that employing a generator network is more advantageous than updating x directly without a generator. In fact, this also holds true for GSN studies. The importance of deep generators is not emphasized in many GSN studies; however, it can be argued that it is omitted because they are inspired by GANs, and the significance of a deep generator is implicitly understood.

3 Methods

In the previous chapters, we examined L-GSO, EGAN, and GLOnet. Our goal is to adopt the surrogate network training strategy of L-GSO while enabling multi-objective optimization, which cannot be implemented in the local surrogate model, and also reducing the excessive dependency between the hyperparameters and the test functions. Meanwhile, we adopt the evolution strategy of EGAN; however, unlike GAN, GSN has an inherently unstable structure; hence, we need to consider how to address this instability. Additionally, EGAN assumes a single-objective fitness score; hence, we need to modify it to work with multi-objective functions.

3.1 A training instability

Refer to caption — Figure 1: a) Structural differences between GAN and GSN. b) A schematic figure illustrating the training instability of GSN. The suboptimal point does not converge towards the near-optimal region, but rather diverges in the opposite direction. c) The vicious cycle between the generator and critic, which is the origin of divergence.

First, let’s explain why surrogate training in GSNs is particularly unstable. The structure of GSNs is almost identical to GANs, hence it might seem that, just as GANs are highly successful in the fake data generation task, GSNs should also be successful. However, in practice, when we run a GSN model with just one generator network and one surrogate network, without employing any special tricks, we encounter a situation where they diverge to infinity almost immediately after starting, in most test cases. This is neither the desired result nor can we consider it being optimized. This is why GSN studies incorporate some kind of special trick.

In addition, the crucial difference between GANs and GSNs lies in the presence or absence of true data. GANs rely on fixed true data; for instance, if the task is to generate images of human faces, a large dataset of real human face photos is required. Discriminator networks possess a significant amount of fixed true data initially and learn additional fake data during training, providing a stable training region. Conversely, GSNs, which are used for black-box optimization, do not have true data. Consequently, we must start from scratch, with no prior information. The first Pareto front points that have been explored, and the surrounding points, might correspond to true data. However, the amount of data near the first Pareto front is insufficient because the first Pareto front in the objective problem is constantly changing and data is collected on-the-fly. To draw an analogy, the GAN represents a situation in which the target is anchored by true data, whereas the GSN represents a situation where the target is floating without an anchor.

This is why the critic network (discriminator network) in GANs and the critic network (surrogate network) in GSNs have similar learning structures; however, they yield completely different results. To summarize the erroneous learning process of GSN’s critic network, we describe its workings as follows:

1.

The critic network begins training with insufficient data.
2.

The training data has minimal information about the Pareto front, which prevents the network from learning anything meaningful.
3.

As the critic network is trained improperly, the gradient it provides fails to guide $x$ towards the Pareto front.
4.

The generator, using incorrect gradient information, produces a point unrelated to the Pareto front when suggesting the next $x$ .
5.

The critic network is then retrained using the poorly generated $x$ .

This leads to a vicious cycle between the generator and the critic network, a situation that does not occur in GANs owing to the presence of true data.

One solution to this problem is to create a data region that corresponds to the true data region found in GANs. In optimization problems, the most crucial information is concentrated at the first Pareto front. Hence, it is necessary to sample the area around the first Pareto front and use it as training data for the critic network. L-GSO addresses this issue by intensively sampling only the local region around $x_{0}$ , representing the zero-dimensional first Pareto front, and supplying that data to the critic network. However, as mentioned in the related works section, the local sampling method cannot be applied to $n$ -objective functions with $n>1$ . Consequently, we need an approach that is similar to the local sampling method and can also accommodate the multi-objective target functions.

Hence, we suggest using ES. The idea is to keep generators $G$ that produce $x$ values close to the first Pareto front and discard those that are farther away. By selecting $G$ through a non-dominated sorting process, we ensure that only $[x,G]$ pairs generating search points near the Pareto front survive at each iteration. Even if the critic network initially learns the incorrect region, as iterations progress, the probability of generating a search point $x$ near the Pareto front increases. Eventually, the critic network will have a stable training data region near the Pareto front; thus breaking the vicious cycle and achieving our desired outcome: stability in critic network training for multi-objective targets.

GEO was developed with the notion that ES complements GSN; however, this idea can also be reversed. As previously mentioned, from the ES perspective, the algorithm consists of mutation and (non-dominated) sorting based on fitness scores, with generator networks being the targets of mutation. In this case, mutation occurs through backpropagation from the critic network, which can be more efficient because the neural network has learned information about the Pareto front. Therefore, from the ES perspective, GSN serves as an auxiliary means to enhance the mutation efficiency of ES.

Conclusively, GEO functions in such a way that GSN complements ES, and ES complements GSN. By doing so, it integrates the strengths of both GSN and ES while mitigating their weaknesses, particularly effectively addressing the instability problem of GSN. Moreover, since both GSN and ES are $\mathcal{O}(N)$ algorithms, GEO is able to maintain $\mathcal{O}(N)$ complexity.

However, it is important to emphasize that both GSN and ES are exploitation-oriented algorithms in terms of exploit and explore strategies. Bayesian optimization, another significant branch of black-box optimization, offers a powerful feature by estimating uncertainty and incorporating it into the next step search. From that perspective, both ES and GSN lack the uncertainty estimation aspect, and even with the addition of supplementary exploration strategies, they may not reach the same level of robust exploration performance as the Bayesian optimization. Consequently, GEO, similar to GSN and ES, cannot guarantee global optimization even though the number of function calls $N$ approaches infinity. Nevertheless, GSN, ES, and GEO are free from the non-linear complexity problem that Bayesian optimization encounters. Hence, it is clear that Bayesian optimization and GEO have distinctly different optimization goals and conditions in which they are best applied.

3.2 Operation steps

The following are the operation sequences for GEO in the context of $n$ -objective black-box optimization $F=(f_{1},f_{2},...,f_{i},...,f_{n})$ . $G$ : generator network, $C_{i}$ : $i_{th}$ Critic network, $z$ : input seed of generator, $x$ : search point of search space $X$ .

Pretraining steps:

1.

Prepare a set of generators and $n$ critic networks. Each critic network predicts one corresponding objective. The networks have not been trained yet.
2.

Prepare the $[x,F(x)]$ training set for initializing the critic network. The Latin Hypercube method[52] is generally recommended as the initial sampling method for search point $x$ . However, the generator network can also be initialized using weight initialization techniques for neural networks, such as Xavier or He initialization.
3.

Store the initialized $[x,F(x)]$ in buffer memory, which has a maximum length.
4.

Pretrain the critic network using the data stored in buffer memory. For example, $C_{i}$ is trained with $[x,f_{i}(x)]$ pairs.

Main iteration steps:

1.

Randomly sample a few generators from the generator set (evolution pool).
2.

Train the generators through backpropagation with the critic network $C_{i}$ . The loss function is $-C_{i}(G(z))$ for the maximization problem. The training of the generators also serves as mutation from the ES perspective. Because there are $n$ critic networks, the single-objective mutation is repeated $n$ times; thus implying that for each sampled $G$ , we make $n$ mutants.
3.

Generate new $x^{\prime}=G^{\prime}(z)$ points from the mutated generators $G^{\prime}$ .
4.

Evaluate $F(x^{\prime})$ from the new $x^{\prime}$ and store the pair $[x^{\prime},F(x^{\prime})]$ in the buffer memory. If the buffer memory’s maximum length is exceeded, delete the previously stored data.
5.

Train the critic networks using the data in the buffer memory.
6.

Save the $[G^{\prime},F(x^{\prime})]$ pair back to the generator set. $F(x^{\prime})$ corresponds to the multi-objective fitness score.
7.

The number of generators in the evolution pool has increased with the newly stored mutated $G^{\prime}$ . Perform a non-dominated sort on them based on their fitness scores. Predetermine a maximum number for the generator set and retain only the generators with high fitness (top Pareto-front data).
8.

Repeat the iteration.

The configuration of the critic network might have been designed to allow a single critic network to predict multiple target objectives. However, we separated the critic network independently for each objective to minimize correlation, under the assumption that it is more common for black-box problems to have independent objective targets. By dividing the critic network into $n$ parts and applying backpropagation separately, it behaves as if the single-objective problem is being run $n$ times independently. Nevertheless, it can be used for multi-objective optimization because the results are combined and ranked using non-dominated sorting.

In addition, if the target function is stochastic, age evolution can be incorporated into the non-dominated sorting step. In age evolution, we can store the time order information of generators together and remove a few of the oldest generators before performing non-dominated sorting.

The provided explanation outlines the basic algorithmic structure of GEO. As described, the operation of GEO is achieved when both the ES-direction cycle and the GSN-direction cycle work together simultaneously.

4 Results

Previously, we mentioned that L-GSO has a particularly strong correlation with the hyperparameter set and test function. However, all black-box optimization problems exhibit a significant correlation between the algorithm type, hyperparameter set, and target function, thus leading to entirely different results with even slight changes. Comparing the performance of optimization algorithms can be challenging precisely because of this reason. It is impossible to prove which hyperparameter set is optimal for a specific test function. Even if the optimal hyperparameter set for a particular test function is found through repeated experimentation, it would constitute overfitting to that specific test function and would no longer be considered black-box optimization. Therefore, a straightforward performance comparison between algorithms for the final results is susceptible to cherry-picking problems. Therefore, this study focuses on describing how the trend of optimization performance depends on the dimension, rather than simply comparing the values of the final results.

In the single-objective function, it is crucial to compare GEO with L-GSO. To ensure a fair comparison, we matched L-GSO’s network configuration with GEO’s, referring to it as Local Surrogate Model (LSM). Examining the performance changes of LSM and GEO as dimensions increase, LSM is more efficient at smaller dimensions. However, its performance declines significantly as the dimensions grow larger. This situation is also evident in classical ES and Bayesian optimization (based on Gaussian process). Although GEO is less efficient at lower dimensions, its efficiency increases as the dimensions grow, outperforming the other methods.

In the related works chapter, we discussed the importance of deep generators. To investigate this further, we conducted an experiment with GEO using a single-layer FC network as the generator. (However, the critic network remains a deep neural network.) The experiment demonstrated that the shallow layer GEO experienced a significant decrease in optimization performance.

We also conducted high-dimensional experiments in multi-objective functions. According to the experimental results, as the dimension increases, the performance of the comparison group declines rapidly, whereas the performance of GEO is relatively well maintained. By the time it reaches 8192 dimensions, there is a significant difference in the final results. The comparison group tends to get trapped in local optima easily; however, GEO manages to escape local optima and makes considerable progress.

However, a limitation of GEO can be identified in one of the experimental results. The ZDT2 function has a concave Pareto-front shape. In this case, although GEO succeeds in optimization, it fails to find the entire shape of the Pareto-front and tends to collapse toward one side. We also experimented the stochastic functions by adding normal random noise to the ZDT test functions. The results for the stochastic multi-objective functions exhibit similar characteristics. Here, the performance of GEO appears to be better compared to the baseline methods; however, it also shows a similar tendency to collapse toward one side while optimizing the ZDT2 function. In the case of ZDT1 and ZDT3, although some lines of the Pareto-front are found, the collapsing tendency is stronger compared to non-stochastic functions.

Summarily, as we intended, GEO successfully overcomes the difficulties of critic network training in GSN and demonstrates stable performance. Because it does not assume locality, it shows excellent performance in multi-objective functions and operates effectively in stochastic environments. Although in cases with low dimensions, GEO’s performance is lower compared to traditional algorithms, possibly due to too large neural network size we used; however, as the number of dimensions increases, it shows better performance than other algorithms.

5 Conclusion

We can observe that GEO successfully accomplishes our five primary objectives: optimizing non-convex, high-dimensional, multi-objective, and stochastic target functions while maintaining $\mathcal{O}(N)$ complexity.

In the Related works chapter, we examined insights into L-GSO and EGAN and combined them to create the foundational concept behind GEO. GSN has problems with unstable critic network training, and to address this, methods that focus on sampling around optimal values can be considered. Although L-GSO employs such an approach, introducing locality to implement it limits the algorithm to single-objective functions and results in excessive sensitivity to hyperparameters. Hence, we introduced ES combination strategy to create a stable training data region near the Pareto-front. This method can be used for multi-objective problems and also resolves the hyperparameter sensitivity problem because it does not assume a separate local $\epsilon$ size. Moreover, GSN and ES work complementarily, enhancing each other’s strengths and compensating for weaknesses, leading to improved efficiency.

As indicated in the Results chapter, GEO appears to be more effective in high dimensions rather than low dimensions. For instance, ES is an algorithm that is advantageous in low dimensions when a large number of function calls is available, whereas Bayesian optimization is favorable in low dimensions with limited number of function calls. Conversely, GEO might have an advantage when many function calls are possible in high dimensions. This observation suggests that GEO has the potential to address optimization areas not covered by existing algorithms.

Because GEO is specialized for high-dimensional non-convex functions, it is worth considering its potential applications in other areas of machine learning. For example, some research trains reinforcement learning (RL) neural networks through black-box optimization [60]. As these techniques require high-dimensional black-box optimization, GEO, which specializes in high-dimensional optimization, could be a viable option.

As previously mentioned in the Results section, one difficulty in black-box optimization research could be the variability in the performance of algorithms. The performance can vary greatly depending on the combination of algorithm type, hyperparameters, and test function type. A certain algorithm and hyperparameter set might be highly effective when targeting a specific test function; however, it may yield poor results for a different test function. Hence, this can lead to biased preparation toward specific test functions, making it easier to cherry-pick results. In the worst-case scenario, one could develop an algorithm specialized for a target test function and fine-tune the algorithm through repeated experiments to obtain good results. Consequently, these results would not be considered genuine black-box optimization outcomes because they utilize prior knowledge gained through the iterative experiments. Therefore, it is challenging to determine the state-of-the-art (SOTA) status in black-box optimization research. Hence, although GEO demonstrates outstanding performance in this study, further research is necessary to determine its performance across various environments and to identify its limitations.

Acknowledgments and Disclosure of Funding

We gratefully acknowledge Kyunghyun Cho, Changwook Jeong, Moon-Hyun Cha, Dae Young Park, and Yujin Hwang for valuable help and discussions. This work was supported by Samsung Electronics Co., Ltd. (IO200626-07474-01)

References

[1] Snoek, J., Larochelle, H., & Adams, R. P. (2012). Practical bayesian optimization of machine learning algorithms. Advances in neural information processing systems, 25.
[2] Frazier, P. I. (2018). A tutorial on Bayesian optimization. arXiv preprint arXiv:1807.02811.
[3] Lan, G., Tomczak, J. M., Roijers, D. M., & Eiben, A. E. (2022). Time efficiency in optimization with a bayesian-evolutionary algorithm. Swarm and Evolutionary Computation, 69, 100970.
[4] Nelder, J. A., & Mead, R. (1965). A simplex method for function minimization. The computer journal, 7(4), 308-313.
[5] Kennedy, J., & Eberhart, R. (1995, November). Particle swarm optimization. In Proceedings of ICNN’95-international conference on neural networks (Vol. 4, pp. 1942-1948). IEEE.
[6] Hooke, R., & Jeeves, T. A. (1961). “Direct Search”Solution of Numerical and Statistical Problems. Journal of the ACM (JACM), 8(2), 212-229.
[7] Deb, K., Pratap, A., Agarwal, S., & Meyarivan, T. A. M. T. (2002). A fast and elitist multiobjective genetic algorithm: NSGA-II. IEEE transactions on evolutionary computation, 6(2), 182-197.
[8] Deb, K., & Sundar, J. (2006, July). Reference point based multi-objective optimization using evolutionary algorithms. In Proceedings of the 8th annual conference on Genetic and evolutionary computation (pp. 635-642).
[9] Hansen, N., & Ostermeier, A. (2001). Completely derandomized self-adaptation in evolution strategies. Evolutionary computation, 9(2), 159-195.
[10] Hansen, N. (2006). The CMA evolution strategy: a comparing review. Towards a new evolutionary computation, 75-102.
[11] Price, K., Storn, R. M., & Lampinen, J. A. (2006). Differential evolution: a practical approach to global optimization. Springer Science & Business Media.
[12] Runarsson, T. P., & Yao, X. (2000). Stochastic ranking for constrained evolutionary optimization. IEEE Transactions on evolutionary computation, 4(3), 284-294.
[13] Runarsson, T. P., & Yao, X. (2005). Search biases in constrained evolutionary optimization. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 35(2), 233-243.
[14] Deb, K., & Jain, H. (2013). An evolutionary many-objective optimization algorithm using reference-point-based nondominated sorting approach, part I: solving problems with box constraints. IEEE transactions on evolutionary computation, 18(4), 577-601.
[15] Blank, J., Deb, K., & Roy, P. C. (2019, March). Investigating the normalization procedure of NSGA-III. In International Conference on Evolutionary Multi-Criterion Optimization (pp. 229-240). Springer, Cham.
[16] Seada, H., & Deb, K. (2015). A unified evolutionary optimization procedure for single, multiple, and many objectives. IEEE Transactions on Evolutionary Computation, 20(3), 358-369.
[17] Vesikar, Y., Deb, K., & Blank, J. (2018, November). Reference point based NSGA-III for preferred solutions. In 2018 IEEE symposium series on computational intelligence (SSCI) (pp. 1587-1594). IEEE.
[18] Carvalho, R. D., Saldanha, R. R., Gomes, B. N., Lisboa, A. C., & Martins, A. X. (2012). A multi-objective evolutionary algorithm based on decomposition for optimal design of Yagi-Uda antennas. IEEE Transactions on Magnetics, 48(2), 803-806.
[19] Li, K., Chen, R., Fu, G., & Yao, X. (2018). Two-archive evolutionary algorithm for constrained multiobjective optimization. IEEE Transactions on Evolutionary Computation, 23(2), 303-315.
[20] Panichella, A. (2019, July). An adaptive evolutionary algorithm based on non-Euclidean geometry for many-objective optimization. In Proceedings of the Genetic and Evolutionary Computation Conference (pp. 595-603).
[21] Shirobokov, S., Belavin, V., Kagan, M., Ustyuzhanin, A., & Baydin, A. G. (2020). Black-box optimization with local generative surrogates. Advances in Neural Information Processing Systems, 33, 14650-14662.
[22] Wang, C., Xu, C., Yao, X., & Tao, D. (2019). Evolutionary generative adversarial networks. IEEE Transactions on Evolutionary Computation, 23(6), 921-934.
[23] Jiang, J., & Fan, J. A. (2019). Global optimization of dielectric metasurfaces using a physics-driven neural network. Nano letters, 19(8), 5366-5372.
[24] Yang, J., Sell, D., & Fan, J. A. (2018). Freeform metagratings based on complex light scattering dynamics for extreme, high efficiency beam steering. Annalen der Physik, 530(1), 1700302.
[25] Hughes, T. W., Minkov, M., Williamson, I. A., & Fan, S. (2018). Adjoint method and inverse design for nonlinear nanophotonic devices. ACS Photonics, 5(12), 4781-4787.
[26] Jensen, J. S., & Sigmund, O. (2011). Topology optimization for nano-photonics. Laser & Photonics Reviews, 5(2), 308-321.
[27] Molesky, S., Lin, Z., Piggott, A. Y., Jin, W., Vucković, J., & Rodriguez, A. W. (2018). Inverse design in nanophotonics. Nature Photonics, 12(11), 659-670.
[28] Faury, L., Calauzenes, C., Fercoq, O., & Krichen, S. (2019). Improving evolutionary strategies with generative neural networks. arXiv preprint arXiv:1901.11271.
[29] Trabucco, B., Kumar, A., Geng, X., & Levine, S. (2021, July). Conservative objective models for effective offline model-based optimization. In International Conference on Machine Learning (pp. 10358-10368). PMLR.
[30] Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., … & Bengio, Y. (2014). Generative adversarial nets. Advances in neural information processing systems, 27.
[31] Karras, T., Aila, T., Laine, S., & Lehtinen, J. (2017). Progressive growing of gans for improved quality, stability, and variation. arXiv preprint arXiv:1710.10196.
[32] Karras, T., Laine, S., & Aila, T. (2019). A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 4401-4410).
[33] Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., & Riedmiller, M. (2013). Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602.
[34] Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap, T., Harley, T., … & Kavukcuoglu, K. (2016, June). Asynchronous methods for deep reinforcement learning. In International conference on machine learning (pp. 1928-1937). PMLR.
[35] Rastrigin, L. A. (1974) Systems of extremal control. Mir, Moscow.
[36] Hoffmeister, F., & Bäck, T. (1990, October). Genetic algorithms and evolution strategies: Similarities and differences. In International Conference on Parallel Problem Solving from Nature (pp. 455-469). Springer, Berlin, Heidelberg.
[37] Mühlenbein, H., Schomisch, M., & Born, J. (1991). The parallel genetic algorithm as function optimizer. Parallel computing, 17(6-7), 619-632.
[38] Ackley, D. (2012). A connectionist machine for genetic hillclimbing (Vol. 28). Springer Science & Business Media.
[39] Rosenbrock, H. (1960). An automatic method for finding the greatest or least value of a function. The Computer Journal, 3(3), 175-184.
[40] Dixon, L. C. W., & Mills, D. J. (1994). Effect of rounding errors on the variable metric method. Journal of Optimization Theory and Applications, 80(1), 175-179.
[41] Styblinski, M. A., & Tang, T. S. (1990). Experiments in nonconvex optimization: stochastic approximation with function smoothing and simulated annealing. Neural Networks, 3(4), 467-483.
[42] Deb, K., Thiele, L., Laumanns, M., & Zitzler, E. (2002, May). Scalable multi-objective optimization test problems. In Proceedings of the 2002 Congress on Evolutionary Computation. CEC’02 (Cat. No. 02TH8600) (Vol. 1, pp. 825-830). IEEE.
[43] Kumar, S. (2020). Balancing a CartPole System with Reinforcement Learning–A Tutorial. arXiv preprint arXiv:2006.04938.
[44] LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11), 2278-2324.
[45] LeCun, Y., Boser, B., Denker, J. S., Henderson, D., Howard, R. E., Hubbard, W., & Jackel, L. D. (1989). Backpropagation applied to handwritten zip code recognition. Neural computation, 1(4), 541-551.
[46] Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning representations by back-propagating errors. nature, 323(6088), 533-536.
[47] Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural computation, 9(8), 1735-1780.
[48] Cho, K., Van Merriënboer, B., Bahdanau, D., & Bengio, Y. (2014). On the properties of neural machine translation: Encoder-decoder approaches. arXiv preprint arXiv:1409.1259.
[49] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., … & Polosukhin, I. (2017). Attention is all you need. Advances in neural information processing systems, 30.
[50] Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2014). Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research, 15(1), 1929-1958.
[51] Blank, J., & Deb, K. (2020). Pymoo: Multi-objective optimization in python. IEEE Access, 8, 89497-89509.
[52] Iman, R. L., Davenport, J. M., & Zeigler, D. K. (1980). Latin hypercube sampling (program user’s guide).[LHC, in FORTRAN] (No. SAND-79-1473). Sandia Labs., Albuquerque, NM (USA).
[53] Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., & Zaremba, W. (2016). Openai gym. arXiv preprint arXiv:1606.01540.
[54] Ros, R., & Hansen, N. (2008, September). A simple modification in CMA-ES achieving linear time and space complexity. In International conference on parallel problem solving from nature (pp. 296-305). Springer, Berlin, Heidelberg.
[55] Akimoto, Y., Auger, A., & Hansen, N. (2014, July). Comparison-based natural gradient optimization in high dimension. In Proceedings of the 2014 Annual Conference on Genetic and Evolutionary Computation (pp. 373-380).
[56] Loshchilov, I. (2017). LM-CMA: An alternative to L-BFGS for large-scale black box optimization. Evolutionary computation, 25(1), 143-171.
[57] Blank, J., & Deb, K. (2022). pysamoo: Surrogate-Assisted Multi-Objective Optimization in Python. arXiv preprint arXiv:2204.05855.
[58] Tian, Y., Wang, H., Zhang, X., & Jin, Y. (2017). Effectiveness and efficiency of non-dominated sorting for evolutionary multi-and many-objective optimization. Complex & Intelligent Systems, 3(4), 247-263.
[59] Long, Q., Wu, X., & Wu, C. (2021). Non-dominated sorting methods for multi-objective optimization: review and numerical comparison. Journal of Industrial & Management Optimization, 17(2), 1001.
[60] Salimans, T., Ho, J., Chen, X., Sidor, S., & Sutskever, I. (2017). Evolution strategies as a scalable alternative to reinforcement learning. arXiv preprint arXiv:1703.03864.
[61] Garnelo, M., Schwarz, J., Rosenbaum, D., Viola, F., Rezende, D. J., Eslami, S. M., & Teh, Y. W. (2018). Neural processes. arXiv preprint arXiv:1807.01622.

Supplementary Materials: Generative Evolutionary Strategy for Black-box Optimization

This appendix elaborates on the algorithm design details and additional experimental results of the Generative Evolutionary Strategy for Black-box Optimization, not covered in the main text.

Appendix A Model implementation details

A.1 Neural Network

In our research, the selection and design of the neural network type plays a fundamental role as our model is grounded on a surrogate neural network. For this particular study, we used a modification of the Transformer [49] (the multi-head self-attention network) model. This structure is suitable because it can adjust to experiments of different sizes without needing any changes. Another advantage is that it can avoid a spatial correlation problem among variables.

While it is true that Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs) are beneficial when it comes to dimension expansion, they do come with their share of challenges, notably the spatial correlation problems. In contrast, a problem of attention-based neural network is computational complexity, which escalates to $\mathcal{O}(d^{2})$ when the variable size hits $d$ . This often results in Graphics Processing Unit (GPU) memory shortages. Therefore, we had to look for other options. We could either choose a network like CNN, which has complexity $\mathcal{O}(d)$ , or find a completely new solution.

To address this, we introduced a strategy we have termed the trunk-branch trick. While this method continues to employ the attention mechanism, it subtly modifies the structure to reduce complexity. First, we create a trunk network. Then, from the trunk, we attach $M$ branch networks. Each of these branches predicts a segment of the total dimension $d$ of the target variable, specifically $d/M$ . The structures of the generator and critic are mirror images of each other, meaning they are symmetrical. For the generator, it starts processing in the trunk and then spreads out to the branches to create the target variable $x$ . The critic works the other way around: its branches take parts of $x$ and bring them back together in the trunk.

Implementing this method effectively brings down the complexity to $\mathcal{O}(d^{2}/M)$ , offering a solution to GPU memory shortages. However, the trunk-branch affects the optimization performance of GEO. The related experiment is described in the Additional experiments chapter.

A.2 Non-dominated sorting

When dealing with scenarios that involve multiple target functions, there is often a competitive interaction between each function’s optimal points. Imagine a variable $x$ that increases the value of one function, $f_{1}(x)$ , while it simultaneously reduces the value of another function, $f_{2}(x)$ , and vice versa. In such situations, we use a method known as non-dominated sorting to identify the optimal point.

Let’s look at minimization problems to understand this better. We label a specific point, $x_{0}$ , as non-dominated when we cannot find another point $x$ that fulfills the condition $f_{i}(x)<f_{i}(x_{0})$ for all the target objective functions $f_{i}$ . These non-dominated points form a group known as the first Pareto-front. Consequently, we can arrange these Pareto-fronts in an order, creating a sequence like the 1st Pareto-front, 2nd Pareto-front, and so on. This ordered arrangement is what we refer to as non-dominated sorting.

There are many different methods in non-dominated sorting, each with its own computational complexity. However, we did not place too much emphasis on this aspect, as our pool size was small enough that the time required for non-dominated sorting was considerably less than that required for neural network computations.

A.3 Training details

In the Methods section of the main text, we noted the potential tendency of GEO towards the exploitation with regard to the exploit-explore strategy. This trait primarily emerges from its design, a combination of ES and GSN. Furthermore, this tendency could limit the algorithm’s potential for exploration. Therefore, we added several strategies to boost its exploration abilities.

One approach we used was to try various learning rates. We prepared a wide range of learning rates, from small to large values, and applied them either randomly or all at once for each mutation event. This approach can provide an escape path if the algorithm gets stuck in a local optimum. We also considered implementing random mutations, changing the parameters of the sampled generator network or specific layers to create mutants. At the very least, these additional techniques do not compromise performance, although we could not definitively establish that they enhanced it.

We also tested various hyperparameters which directly impact experimental results. For example, the pool size is important because it plays a key role in shaping the Pareto-front.

For certain functions, such as the ZDT functions, it is necessary to set boundaries. The boundary construction method can affect performance. We can find the results of these experiments in the Additional experiments section.

For the training of the generator (mutation), we chose to train in the $n$ separate directions using $n$ independent critic networks. There can be another approach where we sum up $n$ fitness scores and then increase the average score. However, this method could possibly lead to a bias towards points in the middle of the Pareto front. Moreover, this approach introduces new hyperparameters related to the normalization of each fitness score, which significantly influences the algorithm’s behavior. Therefore, to circumvent the introduction of new hyperparameters, we chose to conduct the training $n$ times independently.

Appendix B Experiment details

B.1 Non-convex test functions

In our experiment, we analyzed multiple dimensions, each requiring comparative investigation. To make the analysis process consistent, we normalized all test functions in line with their dimension sizes. Most of test functions have the ground state as $0$ ; however, the Styblinski-Tang function does not. Therefore, we recalibrated its value to set $0$ as the ground state. Although all these functions fall under the non-convex category, the Rosenbrock function has distinct properties, as its optimal solution resides within a flat valley region

B.2 Computational details

We aimed to run the algorithm with minimal hardware acceleration. Therefore, we designed the algorithm to be compatible with a single GPU. When we faced memory shortages, we employed the trunk-branch trick, a method we previously discussed in the neural networks chapter. By using the NVIDIA Tesla v100 32GB GPU, it took about three days to execute 100,000 function calls and trainings. The majority of this computational time was dedicated to training the neural networks.

We could enhance the efficiency of neural networks by incorporating a CNN or optimizing the attention model. However, given the scope of our task, which involved 100,000 function calls across 8192 dimensions, the model we currently have is sufficient. Thus, we did not devote substantial time to network optimization. Moreover, we have to avoid excessively refining the algorithm for a specific test function. Such over-tuning could lead to overfitting and consequently limit the model’s performance on different problems especially for real-world problems.

In our study, we made 100,000 function calls for each task. Although this might seem like a considerable number, it is not particularly large when viewed within the context of machine learning and neural networks. Therefore, we found a smaller model that could run on a single GPU to be the most suitable choice; note that if we increase the model size, we can potentially see a decrease in performance. However, if the number of function calls were to increase significantly, we might require a larger model. This could occur in scenarios where the target black-box is readily parallelizable.

B.3 Further investigations

As we previously noted in the main text, Bayesian optimization forms a vital branch of optimization research. This is particularly the case for the Gaussian process-based Bayesian optimization, a practical method employed in machine learning and hyperparameters optimization. Despite its utility, it is important to note that it comes with an $\mathcal{O}(N^{3})$ complexity, which stands as a significant hurdle. A proposed potential solution to this issue is a neural-process. [61] This approach is receiving increasing interest because it offers Gaussian process-like uncertainty estimation while simultaneously reducing time complexity. However, it still demands the accumulation of search point data for uncertainty prediction. This requirement results in a continuous escalation in computation time, preventing it from meeting the $\mathcal{O}(N)$ complexity when employed in optimization tasks.

In the ES domain, numerous modifications have been developed based on well-known and commonly used algorithms such as GA and CMA-ES [54, 55, 56]. In reality, though, optimization experiments typically deal with around 100 dimensions. Sometimes, these experiments may handle larger dimensions; in that case however, they usually focus on simpler convex shapes such as the spherical function and Rosenbrock function; note that Rosenbrock has long flat region around the global minimum.

When it comes to experiments in the 100-1000 dimension range, GSN-based optimization studies have given promising results. [29]. However, the test functions used in these studies are often special types of test function rather than common test functions that have been widely used in ES studies, making it difficult to compare performance. In some cases, Generative Neural Network (GNN) is used, but not the surrogate model. A GNN-based study [28] has shown optimization results around 10 dimensions; however, this is considerably distanced from our target level of dimensions. Therefore, we did not take this algorithm into account.

Research based on surrogate models is also ongoing. For instance, Pysamoo [57] offers packaged optimization algorithms based on surrogate models. Nonetheless, we found that these offered models cannot be effectively applied in high dimensions due to time complexity problems. However, as the package is regularly updated, it could overcome this problem in the future.

Finally, Particle Swarm Optimization (PSO) [5] is a different type of algorithm from ES, but it has some similar features. Here, a certain number of elements swarm towards the global optimum while preserving their group. Therefore, we can consider how to combine PSO and GSN. However, within the scope of this study, we were unable to devise a way to integrate PSO. While ES provides a simple means to link GSN’s backpropagation and non-dominated sorting, establishing a similar connection in PSO poses a challenge.

Appendix C Additional experiments

C.1 Pool size

The pool size has a direct impact on performance, especially if it is too small. In the experiment, a small pool GEO encounters problems when identifying the overall shape of the Pareto-front. This problem may arise when the algorithm loses some lines of the Pareto-front in non-dominated sorting, making them difficult to recover.

C.2 Trunk-branch structure

In the neural network section, we mentioned that the trunk-branch trick is a temporary solution to address the problem of GPU memory shortage. Hence, we also explored the performance related to the trunk-branch strategy.

The experimental result clearly shows that as the number of branches increases, the performance decreases. The trunk-branch structure enhances the time and space efficiency of the attention network but at the expense of optimization performance. Therefore, these results suggest that there are limits to increasing the number of branches in extremely high dimensions. It might imply the necessity for fundamental changes, such as adopting CNNs.

C.3 Boundary conditions

ZDT test functions require boundary conditions in the search space $X$ . To implement these boundary conditions, we attached an additional function at the end of the generator to enforce boundaries.

x=(bound_{max}-bound_{min})\frac{B\left(G(z)\right)+1}{2}+bound_{min}

Boundaries could be implemented with functions such as $B=tanh$ and $B=sin$ , with the periodic boundary condition of the $sin$ function showing slightly better results. This might occur due to the use of the $tanh$ function, which could create a bias at the edges, leading to a concentration of points that exceed the boundary at the edge. This, in turn, generates redundant data during the training of the critic network. Hence, if a boundary is necessary, it is recommended to use a periodic boundary condition.

C.4 Manifold issue

In the L-GSO research, it was suggested that the surrogate could effectively discern manifold structures, and that optimization performance would likely improve within manifold structures than without a manifold structure. As GEO employs a similar surrogate neural network-based algorithm, the same circumstances may arise.

To visually verify this, we conducted an image generation experiment. We considered a simple LeNet-5 [44], trained on the MNIST dataset, as a black-box and optimized it. If GEO prefers manifold structures, images created through black-box optimization should exhibit a smooth shape (likely resembling actual human hand-drawn images).

However, the results are contrary. Even though the optimization is successful, a smooth shape does not emerge; instead, it produces a noise image resembling an adversarial attack. This casts doubt on previous research findings suggesting that GSN-based optimization is better suited for learning manifold structures.

Table 1: Optimization results of Ackley function in low dimensions. 20,000 function calls. 10 repeats.

Ackley
Dimension	2	4	8	16
GEO	0.0000 $\pm$ 0.0000	0.0071 $\pm$ 0.0076	0.1009 $\pm$ 0.0432	0.3014 $\pm$ 0.2451
GEO 1-layer	0.0000 $\pm$ 0.0000	0.0030 $\pm$ 0.0023	0.8575 $\pm$ 0.7513	1.9020 $\pm$ 0.2054
GA	0.0001 $\pm$ 0.0002	0.0016 $\pm$ 0.0008	0.0073 $\pm$ 0.0033	0.0411 $\pm$ 0.0066
CMAES	0.0000 $\pm$ 0.0000	0.0000 $\pm$ 0.0000	0.0000 $\pm$ 0.0000	0.0000 $\pm$ 0.0000
LSM $\epsilon 1.0$	0.0053 $\pm$ 0.0081	0.0261 $\pm$ 0.0118	0.1056 $\pm$ 0.0292	0.2036 $\pm$ 0.1139
LSM $\epsilon 0.2$	0.0006 $\pm$ 0.0003	0.6754 $\pm$ 1.3312	0.0354 $\pm$ 0.0076	0.8967 $\pm$ 0.9222
	32	64	128
GEO	0.0694 $\pm$ 0.0576	0.0361 $\pm$ 0.0185	0.0296 $\pm$ 0.0136
GEO 1-layer	2.7931 $\pm$ 0.1655	3.4449 $\pm$ 0.1133	3.8488 $\pm$ 0.1002
GA	0.1132 $\pm$ 0.0163	0.2795 $\pm$ 0.0304	0.5510 $\pm$ 0.0369
CMAES	0.0000 $\pm$ 0.0000	0.0000 $\pm$ 0.0000	0.0001 $\pm$ 0.0000
LSM $\epsilon 1.0$	0.2430 $\pm$ 0.1160	0.3432 $\pm$ 0.1225	0.8251 $\pm$ 0.2573
LSM $\epsilon 0.2$	2.5080 $\pm$ 1.2227	3.4657 $\pm$ 0.3694	3.3817 $\pm$ 0.3004

Table 2: Optimization results of Rosenbrock function in low dimensions. 20,000 function calls. 10 repeats.

Rosenbrock
Dimension	2	4	8	16
GEO	0.0000 $\pm$ 0.0000	0.0543 $\pm$ 0.1507	0.5090 $\pm$ 0.4545	0.5378 $\pm$ 0.5966
GEO 1-layer	0.0000 $\pm$ 0.0000	0.3062 $\pm$ 0.2330	0.8592 $\pm$ 0.2276	3.2036 $\pm$ 1.7442
GA	0.0001 $\pm$ 0.0001	0.0888 $\pm$ 0.0488	0.5197 $\pm$ 0.1102	0.8267 $\pm$ 0.0755
CMAES	0.0000 $\pm$ 0.0000	0.0000 $\pm$ 0.0000	0.0000 $\pm$ 0.0000	0.0000 $\pm$ 0.0000
LSM $\epsilon 1.0$	0.3805 $\pm$ 0.3780	0.8514 $\pm$ 0.3439	1.3913 $\pm$ 0.1683	1.7246 $\pm$ 0.1924
LSM $\epsilon 0.2$	0.5814 $\pm$ 0.3466	0.5992 $\pm$ 0.3083	0.6444 $\pm$ 0.3757	0.7882 $\pm$ 0.2399
	32	64	128
GEO	0.1705 $\pm$ 0.2817	0.0564 $\pm$ 0.0975	0.0164 $\pm$ 0.0170
GEO 1-layer	11.3029 $\pm$ 1.3108	32.3401 $\pm$ 3.7944	61.4009 $\pm$ 4.9140
GA	1.7519 $\pm$ 0.5589	4.0404 $\pm$ 0.4313	5.9195 $\pm$ 0.2543
CMAES	0.6532 $\pm$ 0.0278	0.9068 $\pm$ 0.0150	0.9734 $\pm$ 0.0102
LSM $\epsilon 1.0$	1.8820 $\pm$ 0.5490	2.1025 $\pm$ 0.7888	2.0884 $\pm$ 0.7405
LSM $\epsilon 0.2$	1.5623 $\pm$ 0.7663	17.4755 $\pm$ 25.1455	38.9798 $\pm$ 23.4691

Table 3: Optimization results of Rastrigin function in low dimensions. 20,000 function calls. 10 repeats.

Rastrigin
Dimension	2	4	8	16
GEO	0.0000 $\pm$ 0.0000	0.1027 $\pm$ 0.1635	0.1434 $\pm$ 0.2483	0.8459 $\pm$ 0.6891
GEO 1-layer	0.0000 $\pm$ 0.0000	0.3483 $\pm$ 0.1219	0.4758 $\pm$ 0.2170	1.2868 $\pm$ 0.3246
GA	0.0000 $\pm$ 0.0000	0.0000 $\pm$ 0.0000	0.0010 $\pm$ 0.0009	0.0127 $\pm$ 0.0048
CMAES	0.2985 $\pm$ 0.3300	0.4477 $\pm$ 0.2168	0.5721 $\pm$ 0.2021	0.4166 $\pm$ 0.1446
LSM $\epsilon 1.0$	0.5375 $\pm$ 0.4406	3.6183 $\pm$ 1.9796	5.6259 $\pm$ 1.2943	5.5754 $\pm$ 1.3568
LSM $\epsilon 0.2$	0.0000 $\pm$ 0.0000	0.5076 $\pm$ 0.4908	0.3248 $\pm$ 0.3222	0.4995 $\pm$ 0.2248
	32	64	128
GEO	1.8010 $\pm$ 1.2062	1.9690 $\pm$ 1.3727	0.8947 $\pm$ 1.1321
GEO 1-layer	2.8961 $\pm$ 0.3151	4.9785 $\pm$ 0.2506	6.4839 $\pm$ 0.2165
GA	0.1580 $\pm$ 0.0452	0.5013 $\pm$ 0.0478	0.9218 $\pm$ 0.0550
CMAES	0.5006 $\pm$ 0.1526	0.5208 $\pm$ 0.0961	0.6630 $\pm$ 0.0978
LSM $\epsilon 1.0$	5.1430 $\pm$ 1.5115	7.8261 $\pm$ 0.8673	8.8664 $\pm$ 0.6537
LSM $\epsilon 0.2$	0.6955 $\pm$ 0.3504	4.9844 $\pm$ 2.2791	7.5184 $\pm$ 1.6464

Table 4: Optimization results of Styblinski function in low dimensions. 20,000 function calls. 10 repeats.

Styblinski-Tang
Dimension	2	4	8	16
GEO	0.0000 $\pm$ 0.0000	0.0009 $\pm$ 0.0014	0.0023 $\pm$ 0.0017	0.0161 $\pm$ 0.0176
GEO 1-layer	2.1695 $\pm$ 3.2087	8.5127 $\pm$ 2.2338	15.9916 $\pm$ 1.1828	22.4661 $\pm$ 0.5406
GA	0.7068 $\pm$ 2.1205	0.7069 $\pm$ 1.4137	3.5346 $\pm$ 2.3707	5.2253 $\pm$ 0.9198
CMAES	7.0684 $\pm$ 5.4751	12.7231 $\pm$ 3.2391	10.9560 $\pm$ 2.5971	9.7190 $\pm$ 2.3708
LSM $\epsilon 1.0$	26.2230 $\pm$ 1.9996	7.6978 $\pm$ 1.2559	6.4853 $\pm$ 3.9141	9.5768 $\pm$ 4.0347
LSM $\epsilon 0.2$	27.6515 $\pm$ 2.8802	16.1667 $\pm$ 1.6806	12.1698 $\pm$ 2.0510	19.3622 $\pm$ 3.1020
	32	64	128
GEO	0.0064 $\pm$ 0.0092	1.4138 $\pm$ 4.2410	0.0000 $\pm$ 0.0000
GEO 1-layer	25.3427 $\pm$ 0.5938	27.5193 $\pm$ 0.6898	29.3574 $\pm$ 0.3288
GA	7.9346 $\pm$ 1.0712	11.2910 $\pm$ 0.5126	18.3960 $\pm$ 0.3724
CMAES	8.4820 $\pm$ 2.2159	9.9841 $\pm$ 0.6612	9.2883 $\pm$ 0.5663
LSM $\epsilon 1.0$	9.7605 $\pm$ 4.3111	8.7600 $\pm$ 2.9911	17.5349 $\pm$ 4.3426
LSM $\epsilon 0.2$	18.9152 $\pm$ 2.6356	32.2802 $\pm$ 2.4185	33.5116 $\pm$ 2.6515

Table 5: Optimization results of test functions in high dimensions. 50,000 function calls. 10 repeats

Ackley
Dimension	256	512	1024
GEO	0.0091 $\pm$ 0.0036	0.0117 $\pm$ 0.0037	0.0084 $\pm$ 0.0029
GA	0.3294 $\pm$ 0.0219	0.7342 $\pm$ 0.0273	1.4256 $\pm$ 0.0477
CMAES	0.0000 $\pm$ 0.0000	0.0003 $\pm$ 0.0000	0.0291 $\pm$ 0.0035
Rosenbrock
Dimension	256	512	1024
GEO	0.0006 $\pm$ 0.0006	0.0004 $\pm$ 0.0003	0.0005 $\pm$ 0.0004
GA	5.0726 $\pm$ 0.2486	5.9915 $\pm$ 0.2358	6.9435 $\pm$ 0.1485
CMAES	0.9742 $\pm$ 0.0062	1.0011 $\pm$ 0.0337	1.0292 $\pm$ 0.0301
Rastrigin
Dimension	256	512	1024
GEO	0.2034 $\pm$ 0.4046	0.0018 $\pm$ 0.0020	0.0034 $\pm$ 0.0057
GA	0.5636 $\pm$ 0.0317	0.9810 $\pm$ 0.3457	1.7443 $\pm$ 0.0638
CMAES	0.9573 $\pm$ 0.1040	1.3981 $\pm$ 0.1383	3.7305 $\pm$ 0.6978
Styblinski-Tang
Dimension	256	512	1024
GEO	0.0000 $\pm$ 0.0000	0.0000 $\pm$ 0.0000	0.0000 $\pm$ 0.0000
GA	14.4536 $\pm$ 0.3616	22.3592 $\pm$ 0.2035	28.8298 $\pm$ 0.1541
CMAES	9.6913 $\pm$ 0.4508	9.3711 $\pm$ 0.1878	9.2048 $\pm$ 0.2450