An Actor-Critic Method for Simulation-Based Optimization

Kuo Li¹, Qing-Shan Jia¹, and Jiaqi Yan² ¹K. Li and Q.-S. Jia are with the Center for Intelligent and Networked System (CFINS), Department of Automation, Beijing National Research Center for Information Science and Technology (BNRist), Tsinghua University, Beijing 100084, P.R. China li-k19, [email protected]²Jiaqi Yan is with the Department of Computer Science, Tokyo Institute of Technology, 4259-J2-54, Nagatsuta-cho, Midori-ku, Yokohama 226-8502, Japan [email protected]This work was supported by the NSFC under Grant 62073182 and the 111 International Collaboration Project (BP2018006).

Abstract

We focus on a simulation-based optimization problem of choosing the best design from the feasible space. Although the simulation model can be queried with finite samples, its internal processing rule cannot be utilized in the optimization process. We formulate the sampling process as a policy searching problem and give a solution from the perspective of Reinforcement Learning (RL). Concretely, Actor-Critic (AC) framework is applied, where the Actor serves as a surrogate model to predict the performance on unknown designs, whereas the actor encodes the sampling policy to be optimized. We design the updating rule and propose two algorithms for the cases where the feasible spaces are continuous and discrete respectively. Some experiments are designed to validate the effectiveness of proposed algorithms, including two toy examples, which intuitively explain the algorithms, and two more complex tasks, i.e., adversarial attack task and RL task, which validate the effectiveness in large-scale problems. The results show that the proposed algorithms can successfully deal with these problems. Especially note that in the RL task, our methods give a new perspective to robot control by treating the task as a simulation model and solving it by optimizing the policy generating process, while existing works commonly optimize the policy itself directly.

I INTRODUCTION

Simulation-based optimization problems arise in various disciplines such as computer science, economics, and so on. In general, an simulation-based optimization problem consists of maximizing (minimizing) a score (cost) function by choosing appropriate designs from an allowed set, and can be typically formulated as

	$\displaystyle\operatorname*{argmax}\limits_{x}{}$	$\displaystyle~{}~{}~{}f_{0}(x)$		(1)
	s.t.	$\displaystyle f_{i}(x)\leq 0,~{}i=1,2,\cdots,m,$		(1)

where $x$ is a design from the allowed set $X$ defined by the constraints $f_{i}(x)\leq 0,~{}i=1,2,\cdots,m$ , and $f_{0}(\cdot)$ is the objective function, which is usually computationally expensive and can be queried only for limited times. The design $x$ can either be a scalar or vector depending on the problem.

There are various works to deal with different cases of (1), e.g., convex optimization deals with the case that all of $f_{i}(\cdot),~{}i=1,2,\cdots,m$ and $-f_{0}(\cdot)$ are convex ( [1]), and deep learning deals with the case that $f_{0}(\cdot)$ is totally or partially calculated by Neural Networks (NNs). However, in both above two cases, the operation rules are designed and utilized during the optimization process, e.g., convex optimization utilizes the gradient or second gradient of $f_{0}(\cdot)$ , and deep learning requires the gradient to all parameters of the NNs. Thus these methods are not suited for simulation-based optimization, where the gradient message cannot be obtained explicitly.

In order to address the issues brought by the “incomplete” objective functions, researchers propose to “reconstruct” it with a controllable surrogate model, which is usually a predictive model constructed by fitting the existing input-output pairs. Since that a certain deviation between the surrogate model and the true simulation model $f_{0}(\cdot)$ cannot be avoided, related methods are usually designed to alternate between updating the surrogate model and the sampling distribution. For example, in Bayesian Optimization (BO, [2]), the Gaussian Process (GP, [3]) serves as the surrogate model, and the updating process alternates between proposing the best design with the acquisition function constructed with the surrogate model and updating the surrogate model with the latest input-output pair. BO has the advantage of high efficiency, but there are some technical difficulties, e.g., finding the design $x$ with the highest score on the acquisition function is non-trivial, especially in high dimensional cases. The acquisition function is a predictive model, which can only predict the performance on any $x\in\mathcal{X}$ , rather than generating the best design $x^{*}$ , which is essential in many real tasks, such as identifying the best parameters for controlling the robots. In these cases, beyond the surrogate model, a generative model must be equipped to identify the best design $x^{*}$ .

The generative model generates a distribution over $\mathcal{X}$ , which controls the sampling process. Therefore, the sampling distribution can be bridged to a stochastic control problem ( [4]), which adjusts the sampling distribution basing on previous information to reach higher expected score on the simulation model $f_{0}(\cdot)$ . The generative model is commonly used in AC-based RL, e.g., DDPG ( [5]), TD3 ( [6]) and SAC ( [7]). It is also termed as “actor” in AC framework, which generates a distribution over the action space (similar as the design space in simulation-based optimization) for each system state (more details can be found in papers mentioned above).

In this work, we also consider the simulation-based optimization problem, where the operation rule of the simulation model $f(\cdot)$ cannot be utilized in the optimizing process. Similar as BO, a surrogate model will be used to predict the scores on untested designs. In order to overcome the difficulty of finding the global optimal of the acquisition function, which is usually highly non-linear, we also introduce a generative model in our framework. Therefore, our methods inherit the framework of AC in RL area. The framework here is especially similar to that of SAC, which is the state-of-art RL algorithm. The predictive model is termed as critic, which predicts $\hat{y}\approx f_{0}(x)$ for each $x\in\mathcal{X}$ , and the generative model is termed as actor, which will be updated to reach higher expected score on $f_{0}(\cdot)$ . Each iteration starts with generating designs by the actor. These designs will be scored by the simulation model, and the obtained input-output pairs will be used to update the critic. The final step is updating the actor with the latest critic.

The contributions of this paper are mainly as follows:

1) Two AC-based algorithms are designed to solve the simulation-based optimization problem, where design space can be either continuous or discrete;

2) We take two high dimensional tasks, adversarial attack and RL, as examples to show the application on real tasks, and the results show that the proposed algorithms can successfully find satisfying solutions, even in the high dimensional cases;

3) Compared with existing RL methods, e.g., Dynamic Programming (DP, [8]), Q-learning ( [9]), DDPG, and SAC, which are motivated by optimizing the policy itself, the proposed algorithms offer a new perspective to robot control of reinforcing the policy through optimizing the policy generating process. Besides, the induced methods can be deployed in an off-policy fashion and avoid the problem of sparse and delayed reward.

II PRELIMINARY

Let us begin with introducing the problem of interest and the Actor-Critic (AC) framework extensively used in Reinforcement Learning (RL) technology.

II-A Optimization Model

The simulation model $f(\cdot)$ takes a scalar (for one dimensional case) or vector (for multi-dimensional case) $x$ as input, and outputs corresponding score $y$ , and we aim to find out the best design, i.e.,

\displaystyle\operatorname*{argmax}\limits_{x\in\mathcal{X}}

\displaystyle~{}f(x),

(2)

where $\mathcal{X}$ is the allowed set.

In real implementations, $f(x)$ could serve as any objective function. For example, it can be a convex optimization problem, if $f(\cdot)$ is a concave function as well as $\mathcal{X}$ is a convex set. $f(\cdot)$ can also be a complex nonlinear function, e.g., Neural Networks (NNs), and the problem comes to find a special design $x$ which receives the maximal response. For another example, in RL tasks, the agent usually intends to find a parameterized policy, which can gather higher accumulated rewards by interacting with the environment. In this case, the environment is a simulation model $f(\cdot)$ , and the policy is a design $x$ . Then it also comes to the standard form of (2). In Section IV, we will demonstrate the applications in these tasks.

II-B Actor-Critic Framework

AC framework is widely used in RL area, which consists of two parts: actor, which generates actions, and critic, which evaluates each action. In RL area, both the actor and critic are typically constructed with NNs, where the actor network encodes a mapping from state $s$ to action $a$ , and the critic network maps each $(s,a)$ pair to its predicted accumulated reward. Related work can be found in DDPG ( [5]), TD3 ( [6]) and SAC ( [7]).

Typically, the critic is trained by minimizing the residual error of the Bellman equation:

\displaystyle L(\phi)=\frac{1}{2}\mathop{\mathds{E}}\left[\left(Q^{\phi}(s,a)-\left(R(s,a)+\gamma\mathop{\mathds{E}}Q^{\phi}(s^{\prime},a^{\prime})\right)\right)^{2}\right],

(3)

where $Q^{\phi}$ is the critic parameterized by $\phi$ , $R(s,a)$ is the instantaneous reward, $s^{\prime}$ is the succeeding state and $a^{\prime}$ is the next action drawn from current policy. Besides, $\gamma\in[0,1)$ is the discount factor for stabilizing the training process. $\phi$ can be updated by minimizing $L(\phi)$ .

The actor is updated by maximizing the expected accumulated reward:

\displaystyle J(\theta)=\mathop{\mathds{E}}\limits_{s}\left[\mathop{\mathds{E}}\limits_{a\sim\pi^{\theta}(s)}Q^{\phi}(s,a)\right],

(4)

where $\pi^{\theta}$ is the current policy parameterized by $\theta$ . Notice that in this work, we choose the stochastic policy instead of the deterministic one for its excellent performance of balancing the exploitation and exploration.

III MAIN METHODS

Let us introduce the methodology of solving (2) with AC framework. The cases where $\mathcal{X}$ is continuous and discrete will be separately considered.

(a) Actor for continuous design space

(b) Actor for discrete design space

Figure 1: Illustration of the network structures. Fig. 1(a) and Fig. 1(b) are the network structures of the actor for continuous and discrete design space respectively. Fig. 1(c) is the network structure of the critic.

We use NNs to construct the actor and critic (Fig. 1). The actor takes a random noise (or constant) as input to capture the multi-modal (or single-modal) distributions, and the critic takes the design $x$ as input and outputs a predicted score.

The actor and critic are updated iteratively. Each iteration starts with generating a feasible design $x$ and getting its score $y=f(x)$ by querying the simulation model $f(\cdot)$ . The $(x,y)$ pair will be stored in a replay buffer $\mathcal{D}$ . Then a mini-batch $\mathcal{B}_{q}$ will be sampled from $\mathcal{D}$ to update the critic by minimizing the loss function:

L(\phi)=\frac{1}{2}\sum\limits_{(x,y)\in\mathcal{B}_{q}}\left(Q^{\phi}(x)-y\right)^{2},

(5)

with one step of gradient descent:

\nabla_{\phi}L(\phi)=\sum\limits_{(x,y)\in\mathcal{B}_{q}}\left(Q^{\phi}(x)-y\right)\nabla_{\phi}Q^{\phi}(x).

(6)

Finally, the actor will be updated with the latest critic.

A stochastic actor is used to generate designs, and in each iteration, the sampling distribution will be updated to improve the expected score. In order to reduce the risk of falling into local optimal, we use entropy regularization to reach the trade-off between exploitation and exploration. The updating rule of the actor will be studied for continuous and discrete design space respectively.

III-A Continuous Design Space

We firstly introduce the case where $\mathcal{X}$ is continuous. For the ease of illustration, this paper focuses on the limited space $\mathcal{X}=\left\{\forall x~{}|~{}||x||_{\infty}<1\right\}$ . However, we note that the results can be easily extended to any continuous limited space through linear mapping.

Squashed Gaussian policy (similar as SAC) is used to generate designs, as shown in Fig. 1(a). Firstly, the actor network outputs two vectors, $\mu^{\theta}$ and $\sigma^{\theta}$ , where $\theta$ is the parameter of the actor network. Then $x$ is generated by

x=g^{\theta}(\xi)=tanh(\mu^{\theta}+\sigma^{\theta}\odot\xi),

(7)

where $\odot$ represents the element-wise multiplication. $\mu^{\theta}$ , $\sigma^{\theta}$ and $\xi$ have the same dimension with $x$ , and $\xi$ is a random noise taken from multi-dimensional standard Gaussian distribution $\xi\sim\mathcal{N}(0,I)$ . The $tanh$ function converts $\mathcal{N}(\mu^{\theta},diag\left(\sigma^{\theta}\right))$ to another distribution whose value range is $(-1,1)$ , and its probability density function can be calculated by

\displaystyle\hbar(x|\theta)=

\displaystyle h(\xi|\theta)\left|\det(\frac{dg^{\theta}(\xi)}{d\xi})\right|^{-1}

(8)

at $\xi=(g^{\theta})^{(-1)}(x)$ , where $h(\cdot|\theta)$ is the probability density function of $\mathcal{N}(\mu^{\theta},diag\left(\sigma^{\theta}\right))$ , and $(g^{\theta})^{(-1)}(\cdot)$ is the inverse function of $g^{\theta}(\cdot)$ .

The sampling distribution will be updated to reach higher expected score

J(\theta)=\mathop{\mathds{E}}\limits_{x\sim\hbar(\cdot|\theta)}Q^{\phi}(x)+\alpha H(\hbar(\cdot|\theta)),

(9)

where $H(\hbar(\cdot|\theta))$ is the entropy regularization for balancing exploitation and exploration, and can be calculated by

H(\hbar(\cdot|\theta))=\mathop{\mathds{E}}\limits_{x\sim\hbar(\cdot|\theta)}[-\log\hbar(x|\theta)].

(10)

Therefore, $J(\theta)$ can be rewritten as

	$\displaystyle J(\theta)=$	$\displaystyle\mathop{\mathds{E}}\limits_{x\sim\hbar(\cdot\|\theta)}\left[Q^{\phi}(x)-\alpha\log\hbar(x\|\theta)\right]$		(11)
	$\displaystyle=$	$\displaystyle\mathop{\mathds{E}}\limits_{\xi\sim\mathcal{N}(0,I)}\left[Q^{\phi}(g^{\theta}(\xi))-\alpha\log\hbar(g^{\theta}(\xi)\|\theta)\right]$		(11)

by substituting (10) to (9), and its gradient with respect to $\theta$ can be derived by:

	$\displaystyle\nabla_{\theta}J(\theta)=$	$\displaystyle\mathop{\mathds{E}}\limits_{\xi\sim\mathcal{N}(0,I)}\left[\nabla_{g^{\theta}(\xi)}\left(Q^{\phi}(g^{\theta}(\xi))\right.\right.$		(12)
		$\displaystyle\left.\left.-\alpha\log\hbar(g^{\theta}(\xi)\|\theta)\right)\nabla_{\theta}g^{\theta}(\xi)\right],$		(12)

which can be estimated by a batch of samples $\mathcal{B}_{a}$ taken from $\mathcal{N}(0,I)$ :

	$\displaystyle\nabla_{\theta}J(\theta)=$	$\displaystyle\frac{1}{N}\sum\limits_{\xi\in\mathcal{B}_{a}}\left[\nabla_{g^{\theta}(\xi)}\left(Q^{\phi}(g^{\theta}(\xi))\right.\right.$		(13)
		$\displaystyle\left.\left.-\alpha\log\hbar(g^{\theta}(\xi)\|\theta)\right)\nabla_{\theta}g^{\theta}(\xi)\right],$		(13)

where $N$ is the size of $\mathcal{B}_{a}$ . Finally, $\theta$ can be updated with (13) through one step of gradient ascent.

We conclude the optimization process in Algorithm 1. In each iteration, only a single design will be taken to be validated with the optimization model, and the data pair will be stored for future use. In the end, the best design in the replay buffer will be chosen as the final result.

Algorithm 1 Optimization with continuous design space

1:for

episode=1:M

2: The actor generates a design

x=g^{\theta}(\xi)=tanh(\mu^{\theta}+\sigma^{\theta}\odot\xi),~{}\xi\sim\mathcal{N}(0,I).

(14)

3: The simulation model gives the score

y=f(x)

4: Store

(x,y)

pair to

\mathcal{D}

5: for

round=1:K

6: Sample a mini-batch

\mathcal{B}_{q}

from

\mathcal{D}

7: Update the critic with gradient estimated by (6).

8: for

round=1:K

9: Sample a mini-batch

\mathcal{B}_{a}

from

\mathcal{N}(0,I)

10: Update the actor with gradient estimated by (13).

11:return the best design stored in

\mathcal{D}

III-B Discrete Design Space

Here we consider the case that there are only finite feasible designs in $\mathcal{X}$ . In this case, the network structure of the actor will be slightly modified to output a discrete distribution over the design space. As illustrated in 1(b), the number of the output channels equals the size of $\mathcal{X}$ , and each channel corresponds to a feasible design. After applying the softmax function, a distribution over $\mathcal{X}$ will be obtained:

P^{\theta}(x_{i})=\frac{\exp\left(\mu^{\theta}(x_{i})\right)}{\sum\limits_{x_{j}\in\mathcal{X}}\exp\left(\mu^{\theta}(x_{j})\right)},~{}\forall x_{i}\in\mathcal{X},

(15)

where by slightly abusing of notation, $\mu^{\theta}(x_{i})$ here denotes the output of the channel corresponding to $x_{i}$ . Then we update the actor by maximizing the expected score:

	$\displaystyle J(\theta)=$	$\displaystyle\mathop{\mathds{E}}\limits_{x\sim P^{\theta}}Q^{\phi}(x)+\alpha H(P^{\theta}(\cdot))$		(16)
	$\displaystyle=$	$\displaystyle\sum\limits_{x\in\mathcal{X}}P^{\theta}(x)\left[Q^{\phi}(x)-\alpha\log P^{\theta}(x)\right],$		(16)

where the entropy regularization is also introduced as in Section III-A. Then, the gradient of $J$ with respect to $\theta$ can be calculated by:

\displaystyle\nabla_{\theta}J(\theta)=\sum\limits_{x\in\mathcal{X}}\nabla_{\theta}P^{\theta}(x)\left[Q^{\phi}(x)-\alpha\log P^{\theta}(x)-\alpha\right],

(17)

and the actor can be updated with one step of stochastic gradient ascent.

Algorithm 2 Optimization with discrete design space

1:for

episode=1:M

2: The actor generates a design

x\sim P^{\theta}(\cdot)

3: The simulation model gives the score

y=f(x)

4: Store

(x,y)

pair to

\mathcal{D}

5: for

round=1:K

6: Sample a mini-batch

\mathcal{B}_{q}

from

\mathcal{D}

7: Update the critic with gradient estimated by (6).

\nabla_{\phi}L(\phi)=\sum\limits_{(x,y)\in\mathcal{B}_{q}}\left(Q^{\phi}(x)-y\right)\nabla_{\phi}Q^{\phi}(x).

(18)

8: for

round=1:K

9: The actor generates the distribution

P^{\theta}(\cdot)

10: The critic predicts score for each design

Q^{\phi}(\cdot)

11: Update the actor with gradient estimated by (17).

12:return the best design stored in

\mathcal{D}

We conclude the optimization process in Algorithm 2. It is worth noting that in each update of the actor, we attempt to maximize the objective function (16), whose optimal solution is the energy-based policy ( [10]):

P^{*}(x_{i})=\frac{\exp\left(\frac{Q^{\phi}(x_{i})}{\alpha}\right)}{\sum\limits_{x_{j}\in\mathcal{X}}\exp\left(\frac{Q^{\phi}(x_{j})}{\alpha}\right)},~{}~{}\forall x_{i}\in\mathcal{X}.

(19)

There is a simple proof of (19) by constructing following optimization problem:

$\displaystyle\operatorname*{argmin}\limits_{P(x)}{}$	$\displaystyle-\sum\limits_{x\in\mathcal{X}}P(x)\left[Q^{\phi}(x)-\alpha\log P(x)\right]$	(20)
s.t.	$\displaystyle~{}~{}~{}~{}~{}~{}~{}-P(x)<0,~{}\forall x\in\mathcal{X}$
	$\displaystyle\sum\limits_{x\in\mathcal{X}}P(x)-1=0.$

It is easy to verify that (20) is a convex problem and satisfies the Slater condition ( [1]), which guarantees that $P^{*}(x)$ is the unique solution of the KKT conditions:


$\displaystyle P^{*}(x)$	$\displaystyle>0,$	$\displaystyle\forall$	$\displaystyle x\in\mathcal{X}$	(21a)
$\displaystyle\sum\limits_{x\in\mathcal{X}}P^{*}(x)$	$\displaystyle=1$			(21b)
$\displaystyle\lambda^{*}(x)$	$\displaystyle\geq 0,$	$\displaystyle\forall$	$\displaystyle x\in\mathcal{X}$	(21c)
$\displaystyle\lambda^{}(x)P^{}(x)$	$\displaystyle=0,$	$\displaystyle\forall$	$\displaystyle x\in\mathcal{X}$	(21d)
$\displaystyle-Q^{\phi}(x)+\alpha\log P^{*}(x)$
$\displaystyle+\alpha-\lambda^{}(x)+\nu^{}$	$\displaystyle=0,$	$\displaystyle\forall$	$\displaystyle x\in\mathcal{X},$	(21e)

where $\lambda^{*}$ and $\nu^{*}$ are the optimal value of dual variables. Combining (21a), (21c) and (21d), we can get $\lambda^{*}(x)=0,\forall x\in\mathcal{X}$ . Then, subtracting it into (21e), we get

\log P^{*}(x)=\frac{Q^{\phi}(x)}{\alpha}+c,~{}\forall x\in\mathcal{X},

(22)

where $c=-1-\frac{\nu^{*}}{\alpha}$ . By subtracting (22) into (21b), we have

c=-\log\sum\limits_{x\in\mathcal{X}}\exp\left(\frac{Q^{\phi}(x)}{\alpha}\right).

(23)

Finally, according to (22) and (23), the optimal solution (19) can be obtained. In Section IV, we will show that the learned distribution is very close to the optimal distribution $P^{*}$ .

So far, we have introduced the methods to solve the simulation-based optimization problem 2 with AC framework, and proposed algorithms for continuous and discrete design space respectively. In the next section, we will design some experiments to validate them.

IV APPLICATIONS

In this section, we will start from a toy example, with which some intuitive interpretation of Algorithm 1 and 2 will be given. Then we design some complex tasks, e.g., adversarial attack and reinforcement learning, to show the effectiveness of the proposed algorithms in large-scale systems.

Firstly, we choose the probability density function of a Gaussian Mixture Model (GMM) as the toy example:

	$\displaystyle f(x)=$	$\displaystyle w_{1}\times\frac{1}{\sqrt{2\pi}\sigma_{1}}\exp\left(-\frac{(x-\mu_{1})^{2}}{2\sigma_{1}^{2}}\right)$		(24)
		$\displaystyle+w_{2}\times\frac{1}{\sqrt{2\pi}\sigma_{2}}\exp\left(-\frac{(x-\mu_{2})^{2}}{2\sigma_{2}^{2}}\right),~{}x\in(-1,1),$		(24)

and the parameters are set as $w_{1}=0.51$ , $w_{2}=0.49$ , $\mu_{1}=-0.7$ , $\mu_{2}=0.7$ , $\sigma_{1}=0.6$ , $\sigma_{2}=0.6$ .

(a)

\alpha=10^{-1}

(b)

\alpha=10^{-2}

(c)

\alpha=10^{-3}

(d)

Deterministic

Figure 2: Results of Algorithm 1. We choose

\alpha=10^{-1}

\alpha=10^{-2}

and

\alpha=10^{-3}

in the experiments, and the results are shown in Fig. 2(a), 2(b) and Fig. 2(c), respectively. In each figure, the blue line is the ground truth of the simulation model

f(\cdot)

. The orange line is the value predicted by the critic

Q^{\phi}(\cdot)

. The green line is the sampling distribution

\hbar(\cdot|\theta)

generated by the actor. The curves are normalized to

[0,1]

for the convenience of comparison. In order to show the effect of the entropy regularization, we replace the stochastic actor with a deterministic one, and the result is shown in Fig. 2(d).

We launch Algorithm 1 and the results are shown in Fig. 2. It can be found that the sampling distribution gradually converges to the optimal area as $\alpha$ decreases, which means by setting an appropriate $\alpha$ to controls the strength of the entropy regularization, satisfying designs can be drawn from the final actor. To analysis the effect of entropy regularization, $\alpha$ is set as $10^{-1}$ , $10^{-2}$ , and $10^{-3}$ respectively. Compared with smaller $\alpha$ in Fig. 2(b) and Fig. 2(c), the actor in Fig. 2(a) generates almost an uniform distribution, which means that the entropy regularization has a greater impact on the objective in (9). Besides, since the samples distribute more evenly over the allowed set, the predicted scores are close to the ground truth almost everywhere. With $\alpha$ decreasing, the sampling distribution narrows to the optimal area (Fig. 2(b) and Fig. 2(c)), which means that the actor can generate satisfying designs. However, the samples concentrate in some areas, which leads to worse performance of the critic in other areas, as illustrated in Fig. 2(c). We also test the performance of a deterministic actor, as in DDPG ( [5]), where the actor only generates a deterministic design, rather than a distribution over the design space. The final design falls into the local optimal more easily, as illustrated in Fig. 2(d). In real implementation, we usually set a larger $\alpha$ in the initial phase, and decrease it step by step. In this way, the sampling distribution can converge to the neighbor area of the global optimal, while avoiding falling into local optimal.

(a)

\alpha=10^{-1}

(b)

\alpha=10^{-2}

(c)

\alpha=10^{-3}

(d)

\alpha=10^{-3}

Figure 3: Results of Algorithm 2. We choose

\alpha=10^{-1}

\alpha=10^{-2}

and

\alpha=10^{-3}

in the experiments. As in Fig. 2, the blue line is the ground truth of the simulation model

f(\cdot)

and the orange line is the value predicted by the critic

Q^{\phi}(\cdot)

. We use histograms to illustrate the sampling distribution among the discrete

\mathcal{X}

. The blue bars denote the sampling distribution

P^{\theta}(\cdot)

generated by the actor and the orange bars denote the optimal distribution

P^{*}(\cdot)

The simulation model (24) is modified to test Algorithm 2 by converting the design space $\mathcal{X}$ to a discrete set. The results are shown in Fig. 3. It can be found that by decreasing the strength of entropy regularization, the sampling distribution gradually converges to the optimal area. We also compare the effect of the entropy regularization by setting $\alpha$ as $10^{-1}$ , $10^{-2}$ , and $\alpha=10^{-3}$ . When a larger $\alpha$ is chosen, e.g., $\alpha=10^{-1}$ in Fig. 3(a), the actor generates nearly uniform distribution. When decreasing it, the learned distribution gradually converges to designs with higher scores, as in Fig. 3(b) and Fig. 3(c). However, if $\alpha$ is set too small, the actor may fall into local optimal, as illustrated in Fig. 3(d). Therefore, in real implementation, we also set a bigger $\alpha$ in the initial phase and decrease it step by step as in the continuous case. Furthermore, it can be found that the distribution $P^{\theta}(\cdot)$ generated by the actor is very close to the optimal one $P^{*}(\cdot)$ , which validates the theoretical analysis.

After the toy examples, we design more complex tasks to validate the proposed algorithms.

The first task is abstracted from adversarial attack tasks. In these tasks, we need to generate samples to “cheat” some NNs, which are trained to recognize the objects in the input pictures. In this experiment, we train a classifier to recognize digital numbers with the public data set MNIST. The classifier $D(\cdot)$ takes an image $x^{\prime}$ as input and generates a distribution over $Y=\{0,1,2,\cdots,9\}$ , which denotes the confidence of classifying the digital image to corresponding value. For the convenience of later representation, the distribution is denoted as $D(x^{\prime})\in R^{10}$ and $D(x^{\prime},i),i\in Y$ denotes the confidence of classifying the image $x^{\prime}\in R^{W\times H}$ into $i$ , where by slight abuse of notations, here the $W$ and $H$ are the size of images. In the adversarial attack tasks, we need to generate some fake images to enhance the data set and train a more robust classifier.

(a) Learning curve of the training process. The curve is drawn by averaging the confidences on

300

samples generated by the actor after each episode.

(b) Image generated by the actor after

1000

episodes.

Figure 4: Learning results.

In order to generate the fake images, we construct the following simulation model:

f(x)=D\left(r\left(\frac{x+1}{2}\right),i\right),

(25)

where $x\in R^{(W\times H)}$ and $r(\cdot)$ is the resizing operator converting the input vector to the appropriate shape. The linear mapping $\frac{x+1}{2}$ converts the value range from $(-1,1)$ to $(0,1)$ , which is the same as the images in the date set. Then by solving the optimization problem (2), the actor can generate fake images which can mislead the classifier to believe it as $i$ . We choose $i=1$ in our experiments. The learning curve is illustrated in Fig. 4(a), where we can find that within $1000$ episodes (only 1 sample is evaluated by $f(\cdot)$ in each episode), the average confidence reaches over $0.95$ . Before the $500$ -th episode, the average confidence is close to $0$ , which may be caused by the low-accuracy critic. After the $500$ -th episode, the score increases gradually, which means both the actor and critic are improved. One of the images generated by the final actor is shown in Fig. 4(b). It is just some noise from the viewpoint of humans, but the classifier tends to predict it as the digital number “1”.

Another type of fake images are also preferred, which remain the features of the initial digital number, but mislead the classifier to make wrong decisions by adding some special noise. In order to generate these images, we construct following simulation model:

f(x)=D\left(c\left(r\left(\frac{x+1}{2}\right)\times\delta+x_{j}\right),i\right),~{}i\neq j

(26)

where the operator $c$ clips the input variables to $[0,1]$ , and $\delta$ is a small constant controlling the noise level, which is set as $0.2$ in our experiment. Besides, $x_{j}$ is an image from the initial data set, whose label is $j$ . In this case, the simulation-based optimization problem (2) tends to find special noise by adding which the classifier will wrongly classify the image from $j$ to $i$ .

Refer to caption — Figure 5: Result of adding special noise to the initial image. The orange curve is the confidence that the classifier recognizes the noisy image as “6”, and the blue curve is the confidence of recognizing it as “1”. The three images in the lower row are the initial image mixed with random noise generated at episode $0$ , $1500$ and $3000$ , respectively.

The result is shown in Fig. 5. After about $2000$ episodes, the actor can generate noise “special” enough to mislead the classifier. We present the noisy image at episodes $0$ , $1500$ , and $3000$ . All of these images remain the features of “6” in the viewpoint of humans, but the classifier will be “cheated” to make wrong decisions.

Finally, we apply Algorithm 1 to a more complex task, i.e., RL task. In typical policy-based and AC-based RL algorithms, the policy is parameterized by $\zeta$ and generates action corresponding to each state. In existing RL methods, the parameter $\zeta$ is adjusted by Policy Gradient (PG) algorithms, as in TRPO ( [11]), PPO( [12]), or Deterministic Policy Gradient (DPG) algorithms, as in DDPG ( [5]) and TD3 ( [6]), to reach higher accumulated reward. While in this work, we directly treat the task as a simulation-based model, where $\zeta$ is a concrete design. We take the “CartPole-v0” environment ( [13]) in Fig. 6(a) as example and deploy Algorithm 1 to it. The result is shown in Fig. 6(b), where we can find that the policy (generated design) is successfully reinforced to reach higher score. Although existing methods in RL area have better performance in these RL tasks, our methods offer a new perspective to robot control, which may be helpful to solve the problem of sparse and delayed reward widely existing in RL problems.

We design these examples to show the application of our proposed methods, as well as validate their performance on large-scale problems. The results will be further concluded in the next section.

V CONCLUSIONS

In this work, we consider the simulation-based optimization problem of selecting the best design with a computationally expensive evaluation model. The sampling process is bridged to the policy optimization problem and solved under the AC framework, where the critic serves as the surrogate model predicting the scores of untested designs, and the actor encodes the sampling distribution. We propose algorithms to update the actor and critic for continuous and discrete design space respectively, and design experiments to validate their effectiveness, as well as explain the applications. The results show that our methods can successfully find satisfying designs and avoid falling into the local optimal to a certain extent. We note that in the experiment of the RL task, we offer a new perspective to robot control of optimizing the policy generating process, rather than optimizing the policy itself. This may be helpful to solve the problem of sparse and delayed reward in existing works. The new perspective will also be studied in future work. Another direction that will be researched in the future is adjusting the hyper-parameter $\alpha$ automatically, as in [14].

References

[1] S. Boyd, S. P. Boyd, and L. Vandenberghe, Convex optimization. Cambridge university press, 2004.
[2] M. Pelikan, D. E. Goldberg, E. Cantú-Paz, et al., “Boa: The bayesian optimization algorithm,” in Proceedings of the genetic and evolutionary computation conference GECCO-99, vol. 1. Citeseer, 1999, pp. 525–532.
[3] R. M. Dudley, “Sample functions of the gaussian process,” Selected Works of RM Dudley, pp. 187–224, 2010.
[4] Y. Peng, E. K. Chong, C.-H. Chen, and M. C. Fu, “Ranking and selection as stochastic control,” IEEE Transactions on Automatic Control, vol. 63, no. 8, pp. 2359–2373, 2018.
[5] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra, “Continuous control with deep reinforcement learning,” arXiv preprint arXiv:1509.02971, 2015.
[6] S. Fujimoto, H. Hoof, and D. Meger, “Addressing function approximation error in actor-critic methods,” in International Conference on Machine Learning. PMLR, 2018, pp. 1587–1596.
[7] T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, “Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor,” in International Conference on Machine Learning. PMLR, 2018, pp. 1861–1870.
[8] R. Bellman, “Dynamic programming,” Science, vol. 153, no. 3731, pp. 34–37, 1966.
[9] C. J. Watkins and P. Dayan, “Q-learning,” Machine learning, vol. 8, no. 3-4, pp. 279–292, 1992.
[10] T. Haarnoja, H. Tang, P. Abbeel, and S. Levine, “Reinforcement learning with deep energy-based policies,” in International Conference on Machine Learning. PMLR, 2017, pp. 1352–1361.
[11] J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz, “Trust region policy optimization,” in International conference on machine learning. PMLR, 2015, pp. 1889–1897.
[12] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,” arXiv preprint arXiv:1707.06347, 2017.
[13] G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba, “Openai gym,” 2016.
[14] T. Haarnoja, A. Zhou, K. Hartikainen, G. Tucker, S. Ha, J. Tan, V. Kumar, H. Zhu, A. Gupta, P. Abbeel, et al., “Soft actor-critic algorithms and applications,” arXiv preprint arXiv:1812.05905, 2018.