This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

An Actor-Critic Method for Simulation-Based Optimization

Kuo Li1, Qing-Shan Jia1, and Jiaqi Yan2 1K. Li and Q.-S. Jia are with the Center for Intelligent and Networked System (CFINS), Department of Automation, Beijing National Research Center for Information Science and Technology (BNRist), Tsinghua University, Beijing 100084, P.R. China li-k19, [email protected]2Jiaqi Yan is with the Department of Computer Science, Tokyo Institute of Technology, 4259-J2-54, Nagatsuta-cho, Midori-ku, Yokohama 226-8502, Japan [email protected]This work was supported by the NSFC under Grant 62073182 and the 111 International Collaboration Project (BP2018006).
Abstract

We focus on a simulation-based optimization problem of choosing the best design from the feasible space. Although the simulation model can be queried with finite samples, its internal processing rule cannot be utilized in the optimization process. We formulate the sampling process as a policy searching problem and give a solution from the perspective of Reinforcement Learning (RL). Concretely, Actor-Critic (AC) framework is applied, where the Actor serves as a surrogate model to predict the performance on unknown designs, whereas the actor encodes the sampling policy to be optimized. We design the updating rule and propose two algorithms for the cases where the feasible spaces are continuous and discrete respectively. Some experiments are designed to validate the effectiveness of proposed algorithms, including two toy examples, which intuitively explain the algorithms, and two more complex tasks, i.e., adversarial attack task and RL task, which validate the effectiveness in large-scale problems. The results show that the proposed algorithms can successfully deal with these problems. Especially note that in the RL task, our methods give a new perspective to robot control by treating the task as a simulation model and solving it by optimizing the policy generating process, while existing works commonly optimize the policy itself directly.

I INTRODUCTION

Simulation-based optimization problems arise in various disciplines such as computer science, economics, and so on. In general, an simulation-based optimization problem consists of maximizing (minimizing) a score (cost) function by choosing appropriate designs from an allowed set, and can be typically formulated as

argmaxx\displaystyle\operatorname*{argmax}\limits_{x}{} f0(x)\displaystyle~{}~{}~{}f_{0}(x) (1)
s.t. fi(x)0,i=1,2,,m,\displaystyle f_{i}(x)\leq 0,~{}i=1,2,\cdots,m,

where xx is a design from the allowed set XX defined by the constraints fi(x)0,i=1,2,,mf_{i}(x)\leq 0,~{}i=1,2,\cdots,m, and f0()f_{0}(\cdot) is the objective function, which is usually computationally expensive and can be queried only for limited times. The design xx can either be a scalar or vector depending on the problem.

There are various works to deal with different cases of (1), e.g., convex optimization deals with the case that all of fi(),i=1,2,,mf_{i}(\cdot),~{}i=1,2,\cdots,m and f0()-f_{0}(\cdot) are convex ( [1]), and deep learning deals with the case that f0()f_{0}(\cdot) is totally or partially calculated by Neural Networks (NNs). However, in both above two cases, the operation rules are designed and utilized during the optimization process, e.g., convex optimization utilizes the gradient or second gradient of f0()f_{0}(\cdot), and deep learning requires the gradient to all parameters of the NNs. Thus these methods are not suited for simulation-based optimization, where the gradient message cannot be obtained explicitly.

In order to address the issues brought by the “incomplete” objective functions, researchers propose to “reconstruct” it with a controllable surrogate model, which is usually a predictive model constructed by fitting the existing input-output pairs. Since that a certain deviation between the surrogate model and the true simulation model f0()f_{0}(\cdot) cannot be avoided, related methods are usually designed to alternate between updating the surrogate model and the sampling distribution. For example, in Bayesian Optimization (BO, [2]), the Gaussian Process (GP, [3]) serves as the surrogate model, and the updating process alternates between proposing the best design with the acquisition function constructed with the surrogate model and updating the surrogate model with the latest input-output pair. BO has the advantage of high efficiency, but there are some technical difficulties, e.g., finding the design xx with the highest score on the acquisition function is non-trivial, especially in high dimensional cases. The acquisition function is a predictive model, which can only predict the performance on any x𝒳x\in\mathcal{X}, rather than generating the best design xx^{*}, which is essential in many real tasks, such as identifying the best parameters for controlling the robots. In these cases, beyond the surrogate model, a generative model must be equipped to identify the best design xx^{*}.

The generative model generates a distribution over 𝒳\mathcal{X}, which controls the sampling process. Therefore, the sampling distribution can be bridged to a stochastic control problem ( [4]), which adjusts the sampling distribution basing on previous information to reach higher expected score on the simulation model f0()f_{0}(\cdot). The generative model is commonly used in AC-based RL, e.g., DDPG ( [5]), TD3 ( [6]) and SAC ( [7]). It is also termed as “actor” in AC framework, which generates a distribution over the action space (similar as the design space in simulation-based optimization) for each system state (more details can be found in papers mentioned above).

In this work, we also consider the simulation-based optimization problem, where the operation rule of the simulation model f()f(\cdot) cannot be utilized in the optimizing process. Similar as BO, a surrogate model will be used to predict the scores on untested designs. In order to overcome the difficulty of finding the global optimal of the acquisition function, which is usually highly non-linear, we also introduce a generative model in our framework. Therefore, our methods inherit the framework of AC in RL area. The framework here is especially similar to that of SAC, which is the state-of-art RL algorithm. The predictive model is termed as critic, which predicts y^f0(x)\hat{y}\approx f_{0}(x) for each x𝒳x\in\mathcal{X}, and the generative model is termed as actor, which will be updated to reach higher expected score on f0()f_{0}(\cdot). Each iteration starts with generating designs by the actor. These designs will be scored by the simulation model, and the obtained input-output pairs will be used to update the critic. The final step is updating the actor with the latest critic.

The contributions of this paper are mainly as follows:

1) Two AC-based algorithms are designed to solve the simulation-based optimization problem, where design space can be either continuous or discrete;

2) We take two high dimensional tasks, adversarial attack and RL, as examples to show the application on real tasks, and the results show that the proposed algorithms can successfully find satisfying solutions, even in the high dimensional cases;

3) Compared with existing RL methods, e.g., Dynamic Programming (DP, [8]), Q-learning ( [9]), DDPG, and SAC, which are motivated by optimizing the policy itself, the proposed algorithms offer a new perspective to robot control of reinforcing the policy through optimizing the policy generating process. Besides, the induced methods can be deployed in an off-policy fashion and avoid the problem of sparse and delayed reward.

II PRELIMINARY

Let us begin with introducing the problem of interest and the Actor-Critic (AC) framework extensively used in Reinforcement Learning (RL) technology.

II-A Optimization Model

The simulation model f()f(\cdot) takes a scalar (for one dimensional case) or vector (for multi-dimensional case) xx as input, and outputs corresponding score yy, and we aim to find out the best design, i.e.,

argmaxx𝒳\displaystyle\operatorname*{argmax}\limits_{x\in\mathcal{X}} f(x),\displaystyle~{}f(x), (2)

where 𝒳\mathcal{X} is the allowed set.

In real implementations, f(x)f(x) could serve as any objective function. For example, it can be a convex optimization problem, if f()f(\cdot) is a concave function as well as 𝒳\mathcal{X} is a convex set. f()f(\cdot) can also be a complex nonlinear function, e.g., Neural Networks (NNs), and the problem comes to find a special design xx which receives the maximal response. For another example, in RL tasks, the agent usually intends to find a parameterized policy, which can gather higher accumulated rewards by interacting with the environment. In this case, the environment is a simulation model f()f(\cdot), and the policy is a design xx. Then it also comes to the standard form of (2). In Section IV, we will demonstrate the applications in these tasks.

II-B Actor-Critic Framework

AC framework is widely used in RL area, which consists of two parts: actor, which generates actions, and critic, which evaluates each action. In RL area, both the actor and critic are typically constructed with NNs, where the actor network encodes a mapping from state ss to action aa, and the critic network maps each (s,a)(s,a) pair to its predicted accumulated reward. Related work can be found in DDPG ( [5]), TD3 ( [6]) and SAC ( [7]).

Typically, the critic is trained by minimizing the residual error of the Bellman equation:

L(ϕ)=12𝔼[(Qϕ(s,a)(R(s,a)+γ𝔼Qϕ(s,a)))2],\displaystyle L(\phi)=\frac{1}{2}\mathop{\mathds{E}}\left[\left(Q^{\phi}(s,a)-\left(R(s,a)+\gamma\mathop{\mathds{E}}Q^{\phi}(s^{\prime},a^{\prime})\right)\right)^{2}\right], (3)

where QϕQ^{\phi} is the critic parameterized by ϕ\phi, R(s,a)R(s,a) is the instantaneous reward, ss^{\prime} is the succeeding state and aa^{\prime} is the next action drawn from current policy. Besides, γ[0,1)\gamma\in[0,1) is the discount factor for stabilizing the training process. ϕ\phi can be updated by minimizing L(ϕ)L(\phi).

The actor is updated by maximizing the expected accumulated reward:

J(θ)=𝔼s[𝔼aπθ(s)Qϕ(s,a)],\displaystyle J(\theta)=\mathop{\mathds{E}}\limits_{s}\left[\mathop{\mathds{E}}\limits_{a\sim\pi^{\theta}(s)}Q^{\phi}(s,a)\right], (4)

where πθ\pi^{\theta} is the current policy parameterized by θ\theta. Notice that in this work, we choose the stochastic policy instead of the deterministic one for its excellent performance of balancing the exploitation and exploration.

III MAIN METHODS

Let us introduce the methodology of solving (2) with AC framework. The cases where 𝒳\mathcal{X} is continuous and discrete will be separately considered.

Refer to caption
(a) Actor for continuous design space
Refer to caption
(b) Actor for discrete design space
Refer to caption
(c) Critic
Figure 1: Illustration of the network structures. Fig. 1(a) and Fig. 1(b) are the network structures of the actor for continuous and discrete design space respectively. Fig. 1(c) is the network structure of the critic.

We use NNs to construct the actor and critic (Fig. 1). The actor takes a random noise (or constant) as input to capture the multi-modal (or single-modal) distributions, and the critic takes the design xx as input and outputs a predicted score.

The actor and critic are updated iteratively. Each iteration starts with generating a feasible design xx and getting its score y=f(x)y=f(x) by querying the simulation model f()f(\cdot). The (x,y)(x,y) pair will be stored in a replay buffer 𝒟\mathcal{D}. Then a mini-batch q\mathcal{B}_{q} will be sampled from 𝒟\mathcal{D} to update the critic by minimizing the loss function:

L(ϕ)=12(x,y)q(Qϕ(x)y)2,L(\phi)=\frac{1}{2}\sum\limits_{(x,y)\in\mathcal{B}_{q}}\left(Q^{\phi}(x)-y\right)^{2}, (5)

with one step of gradient descent:

ϕL(ϕ)=(x,y)q(Qϕ(x)y)ϕQϕ(x).\nabla_{\phi}L(\phi)=\sum\limits_{(x,y)\in\mathcal{B}_{q}}\left(Q^{\phi}(x)-y\right)\nabla_{\phi}Q^{\phi}(x). (6)

Finally, the actor will be updated with the latest critic.

A stochastic actor is used to generate designs, and in each iteration, the sampling distribution will be updated to improve the expected score. In order to reduce the risk of falling into local optimal, we use entropy regularization to reach the trade-off between exploitation and exploration. The updating rule of the actor will be studied for continuous and discrete design space respectively.

III-A Continuous Design Space

We firstly introduce the case where 𝒳\mathcal{X} is continuous. For the ease of illustration, this paper focuses on the limited space 𝒳={x|x<1}\mathcal{X}=\left\{\forall x~{}|~{}||x||_{\infty}<1\right\}. However, we note that the results can be easily extended to any continuous limited space through linear mapping.

Squashed Gaussian policy (similar as SAC) is used to generate designs, as shown in Fig. 1(a). Firstly, the actor network outputs two vectors, μθ\mu^{\theta} and σθ\sigma^{\theta}, where θ\theta is the parameter of the actor network. Then xx is generated by

x=gθ(ξ)=tanh(μθ+σθξ),x=g^{\theta}(\xi)=tanh(\mu^{\theta}+\sigma^{\theta}\odot\xi), (7)

where \odot represents the element-wise multiplication. μθ\mu^{\theta}, σθ\sigma^{\theta} and ξ\xi have the same dimension with xx, and ξ\xi is a random noise taken from multi-dimensional standard Gaussian distribution ξ𝒩(0,I)\xi\sim\mathcal{N}(0,I). The tanhtanh function converts 𝒩(μθ,diag(σθ))\mathcal{N}(\mu^{\theta},diag\left(\sigma^{\theta}\right)) to another distribution whose value range is (1,1)(-1,1), and its probability density function can be calculated by

(x|θ)=\displaystyle\hbar(x|\theta)= h(ξ|θ)|det(dgθ(ξ)dξ)|1\displaystyle h(\xi|\theta)\left|\det(\frac{dg^{\theta}(\xi)}{d\xi})\right|^{-1} (8)

at ξ=(gθ)(1)(x)\xi=(g^{\theta})^{(-1)}(x), where h(|θ)h(\cdot|\theta) is the probability density function of 𝒩(μθ,diag(σθ))\mathcal{N}(\mu^{\theta},diag\left(\sigma^{\theta}\right)), and (gθ)(1)()(g^{\theta})^{(-1)}(\cdot) is the inverse function of gθ()g^{\theta}(\cdot).

The sampling distribution will be updated to reach higher expected score

J(θ)=𝔼x(|θ)Qϕ(x)+αH((|θ)),J(\theta)=\mathop{\mathds{E}}\limits_{x\sim\hbar(\cdot|\theta)}Q^{\phi}(x)+\alpha H(\hbar(\cdot|\theta)), (9)

where H((|θ))H(\hbar(\cdot|\theta)) is the entropy regularization for balancing exploitation and exploration, and can be calculated by

H((|θ))=𝔼x(|θ)[log(x|θ)].H(\hbar(\cdot|\theta))=\mathop{\mathds{E}}\limits_{x\sim\hbar(\cdot|\theta)}[-\log\hbar(x|\theta)]. (10)

Therefore, J(θ)J(\theta) can be rewritten as

J(θ)=\displaystyle J(\theta)= 𝔼x(|θ)[Qϕ(x)αlog(x|θ)]\displaystyle\mathop{\mathds{E}}\limits_{x\sim\hbar(\cdot|\theta)}\left[Q^{\phi}(x)-\alpha\log\hbar(x|\theta)\right] (11)
=\displaystyle= 𝔼ξ𝒩(0,I)[Qϕ(gθ(ξ))αlog(gθ(ξ)|θ)]\displaystyle\mathop{\mathds{E}}\limits_{\xi\sim\mathcal{N}(0,I)}\left[Q^{\phi}(g^{\theta}(\xi))-\alpha\log\hbar(g^{\theta}(\xi)|\theta)\right]

by substituting (10) to (9), and its gradient with respect to θ\theta can be derived by:

θJ(θ)=\displaystyle\nabla_{\theta}J(\theta)= 𝔼ξ𝒩(0,I)[gθ(ξ)(Qϕ(gθ(ξ))\displaystyle\mathop{\mathds{E}}\limits_{\xi\sim\mathcal{N}(0,I)}\left[\nabla_{g^{\theta}(\xi)}\left(Q^{\phi}(g^{\theta}(\xi))\right.\right. (12)
αlog(gθ(ξ)|θ))θgθ(ξ)],\displaystyle\left.\left.-\alpha\log\hbar(g^{\theta}(\xi)|\theta)\right)\nabla_{\theta}g^{\theta}(\xi)\right],

which can be estimated by a batch of samples a\mathcal{B}_{a} taken from 𝒩(0,I)\mathcal{N}(0,I):

θJ(θ)=\displaystyle\nabla_{\theta}J(\theta)= 1Nξa[gθ(ξ)(Qϕ(gθ(ξ))\displaystyle\frac{1}{N}\sum\limits_{\xi\in\mathcal{B}_{a}}\left[\nabla_{g^{\theta}(\xi)}\left(Q^{\phi}(g^{\theta}(\xi))\right.\right. (13)
αlog(gθ(ξ)|θ))θgθ(ξ)],\displaystyle\left.\left.-\alpha\log\hbar(g^{\theta}(\xi)|\theta)\right)\nabla_{\theta}g^{\theta}(\xi)\right],

where NN is the size of a\mathcal{B}_{a}. Finally, θ\theta can be updated with (13) through one step of gradient ascent.

We conclude the optimization process in Algorithm 1. In each iteration, only a single design will be taken to be validated with the optimization model, and the data pair will be stored for future use. In the end, the best design in the replay buffer will be chosen as the final result.

Algorithm 1 Optimization with continuous design space
1:for episode=1:Mepisode=1:M do
2:     The actor generates a design
x=gθ(ξ)=tanh(μθ+σθξ),ξ𝒩(0,I).x=g^{\theta}(\xi)=tanh(\mu^{\theta}+\sigma^{\theta}\odot\xi),~{}\xi\sim\mathcal{N}(0,I). (14)
3:     The simulation model gives the score y=f(x)y=f(x).
4:     Store (x,y)(x,y) pair to 𝒟\mathcal{D}.
5:     for round=1:Kround=1:K do
6:         Sample a mini-batch q\mathcal{B}_{q} from 𝒟\mathcal{D}.
7:         Update the critic with gradient estimated by (6).      
8:     for round=1:Kround=1:K do
9:         Sample a mini-batch a\mathcal{B}_{a} from 𝒩(0,I)\mathcal{N}(0,I).
10:         Update the actor with gradient estimated by (13).      
11:return the best design stored in 𝒟\mathcal{D}.

III-B Discrete Design Space

Here we consider the case that there are only finite feasible designs in 𝒳\mathcal{X}. In this case, the network structure of the actor will be slightly modified to output a discrete distribution over the design space. As illustrated in 1(b), the number of the output channels equals the size of 𝒳\mathcal{X}, and each channel corresponds to a feasible design. After applying the softmax function, a distribution over 𝒳\mathcal{X} will be obtained:

Pθ(xi)=exp(μθ(xi))xj𝒳exp(μθ(xj)),xi𝒳,P^{\theta}(x_{i})=\frac{\exp\left(\mu^{\theta}(x_{i})\right)}{\sum\limits_{x_{j}\in\mathcal{X}}\exp\left(\mu^{\theta}(x_{j})\right)},~{}\forall x_{i}\in\mathcal{X}, (15)

where by slightly abusing of notation, μθ(xi)\mu^{\theta}(x_{i}) here denotes the output of the channel corresponding to xix_{i}. Then we update the actor by maximizing the expected score:

J(θ)=\displaystyle J(\theta)= 𝔼xPθQϕ(x)+αH(Pθ())\displaystyle\mathop{\mathds{E}}\limits_{x\sim P^{\theta}}Q^{\phi}(x)+\alpha H(P^{\theta}(\cdot)) (16)
=\displaystyle= x𝒳Pθ(x)[Qϕ(x)αlogPθ(x)],\displaystyle\sum\limits_{x\in\mathcal{X}}P^{\theta}(x)\left[Q^{\phi}(x)-\alpha\log P^{\theta}(x)\right],

where the entropy regularization is also introduced as in Section III-A. Then, the gradient of JJ with respect to θ\theta can be calculated by:

θJ(θ)=x𝒳θPθ(x)[Qϕ(x)αlogPθ(x)α],\displaystyle\nabla_{\theta}J(\theta)=\sum\limits_{x\in\mathcal{X}}\nabla_{\theta}P^{\theta}(x)\left[Q^{\phi}(x)-\alpha\log P^{\theta}(x)-\alpha\right], (17)

and the actor can be updated with one step of stochastic gradient ascent.

Algorithm 2 Optimization with discrete design space
1:for episode=1:Mepisode=1:M do
2:     The actor generates a design xPθ()x\sim P^{\theta}(\cdot).
3:     The simulation model gives the score y=f(x)y=f(x).
4:     Store (x,y)(x,y) pair to 𝒟\mathcal{D}.
5:     for round=1:Kround=1:K do
6:         Sample a mini-batch q\mathcal{B}_{q} from 𝒟\mathcal{D}.
7:         Update the critic with gradient estimated by (6).
ϕL(ϕ)=(x,y)q(Qϕ(x)y)ϕQϕ(x).\nabla_{\phi}L(\phi)=\sum\limits_{(x,y)\in\mathcal{B}_{q}}\left(Q^{\phi}(x)-y\right)\nabla_{\phi}Q^{\phi}(x). (18)
     
8:     for round=1:Kround=1:K do
9:         The actor generates the distribution Pθ()P^{\theta}(\cdot).
10:         The critic predicts score for each design Qϕ()Q^{\phi}(\cdot).
11:         Update the actor with gradient estimated by (17).      
12:return the best design stored in 𝒟\mathcal{D}.

We conclude the optimization process in Algorithm 2. It is worth noting that in each update of the actor, we attempt to maximize the objective function (16), whose optimal solution is the energy-based policy ( [10]):

P(xi)=exp(Qϕ(xi)α)xj𝒳exp(Qϕ(xj)α),xi𝒳.P^{*}(x_{i})=\frac{\exp\left(\frac{Q^{\phi}(x_{i})}{\alpha}\right)}{\sum\limits_{x_{j}\in\mathcal{X}}\exp\left(\frac{Q^{\phi}(x_{j})}{\alpha}\right)},~{}~{}\forall x_{i}\in\mathcal{X}. (19)

There is a simple proof of (19) by constructing following optimization problem:

argminP(x)\displaystyle\operatorname*{argmin}\limits_{P(x)}{} x𝒳P(x)[Qϕ(x)αlogP(x)]\displaystyle-\sum\limits_{x\in\mathcal{X}}P(x)\left[Q^{\phi}(x)-\alpha\log P(x)\right] (20)
s.t. P(x)<0,x𝒳\displaystyle~{}~{}~{}~{}~{}~{}~{}-P(x)<0,~{}\forall x\in\mathcal{X}
x𝒳P(x)1=0.\displaystyle\sum\limits_{x\in\mathcal{X}}P(x)-1=0.

It is easy to verify that (20) is a convex problem and satisfies the Slater condition ( [1]), which guarantees that P(x)P^{*}(x) is the unique solution of the KKT conditions:

P(x)\displaystyle P^{*}(x) >0,\displaystyle>0, \displaystyle\forall x𝒳\displaystyle x\in\mathcal{X} (21a)
x𝒳P(x)\displaystyle\sum\limits_{x\in\mathcal{X}}P^{*}(x) =1\displaystyle=1 (21b)
λ(x)\displaystyle\lambda^{*}(x) 0,\displaystyle\geq 0, \displaystyle\forall x𝒳\displaystyle x\in\mathcal{X} (21c)
λ(x)P(x)\displaystyle\lambda^{*}(x)P^{*}(x) =0,\displaystyle=0, \displaystyle\forall x𝒳\displaystyle x\in\mathcal{X} (21d)
Qϕ(x)+αlogP(x)\displaystyle-Q^{\phi}(x)+\alpha\log P^{*}(x)
+αλ(x)+ν\displaystyle+\alpha-\lambda^{*}(x)+\nu^{*} =0,\displaystyle=0, \displaystyle\forall x𝒳,\displaystyle x\in\mathcal{X}, (21e)

where λ\lambda^{*} and ν\nu^{*} are the optimal value of dual variables. Combining (21a), (21c) and (21d), we can get λ(x)=0,x𝒳\lambda^{*}(x)=0,\forall x\in\mathcal{X}. Then, subtracting it into (21e), we get

logP(x)=Qϕ(x)α+c,x𝒳,\log P^{*}(x)=\frac{Q^{\phi}(x)}{\alpha}+c,~{}\forall x\in\mathcal{X}, (22)

where c=1ναc=-1-\frac{\nu^{*}}{\alpha}. By subtracting (22) into (21b), we have

c=logx𝒳exp(Qϕ(x)α).c=-\log\sum\limits_{x\in\mathcal{X}}\exp\left(\frac{Q^{\phi}(x)}{\alpha}\right). (23)

Finally, according to (22) and (23), the optimal solution (19) can be obtained. In Section IV, we will show that the learned distribution is very close to the optimal distribution PP^{*}.

So far, we have introduced the methods to solve the simulation-based optimization problem 2 with AC framework, and proposed algorithms for continuous and discrete design space respectively. In the next section, we will design some experiments to validate them.

IV APPLICATIONS

In this section, we will start from a toy example, with which some intuitive interpretation of Algorithm 1 and 2 will be given. Then we design some complex tasks, e.g., adversarial attack and reinforcement learning, to show the effectiveness of the proposed algorithms in large-scale systems.

Firstly, we choose the probability density function of a Gaussian Mixture Model (GMM) as the toy example:

f(x)=\displaystyle f(x)= w1×12πσ1exp((xμ1)22σ12)\displaystyle w_{1}\times\frac{1}{\sqrt{2\pi}\sigma_{1}}\exp\left(-\frac{(x-\mu_{1})^{2}}{2\sigma_{1}^{2}}\right) (24)
+w2×12πσ2exp((xμ2)22σ22),x(1,1),\displaystyle+w_{2}\times\frac{1}{\sqrt{2\pi}\sigma_{2}}\exp\left(-\frac{(x-\mu_{2})^{2}}{2\sigma_{2}^{2}}\right),~{}x\in(-1,1),

and the parameters are set as w1=0.51w_{1}=0.51, w2=0.49w_{2}=0.49, μ1=0.7\mu_{1}=-0.7, μ2=0.7\mu_{2}=0.7, σ1=0.6\sigma_{1}=0.6, σ2=0.6\sigma_{2}=0.6.

Refer to caption
(a) α=101\alpha=10^{-1}
Refer to caption
(b) α=102\alpha=10^{-2}
Refer to caption
(c) α=103\alpha=10^{-3}
Refer to caption
(d) DeterministicDeterministic
Figure 2: Results of Algorithm 1. We choose α=101\alpha=10^{-1}, α=102\alpha=10^{-2} and α=103\alpha=10^{-3} in the experiments, and the results are shown in Fig. 2(a), 2(b) and Fig. 2(c), respectively. In each figure, the blue line is the ground truth of the simulation model f()f(\cdot). The orange line is the value predicted by the critic Qϕ()Q^{\phi}(\cdot). The green line is the sampling distribution (|θ)\hbar(\cdot|\theta) generated by the actor. The curves are normalized to [0,1][0,1] for the convenience of comparison. In order to show the effect of the entropy regularization, we replace the stochastic actor with a deterministic one, and the result is shown in Fig. 2(d).

We launch Algorithm 1 and the results are shown in Fig. 2. It can be found that the sampling distribution gradually converges to the optimal area as α\alpha decreases, which means by setting an appropriate α\alpha to controls the strength of the entropy regularization, satisfying designs can be drawn from the final actor. To analysis the effect of entropy regularization, α\alpha is set as 10110^{-1}, 10210^{-2}, and 10310^{-3} respectively. Compared with smaller α\alpha in Fig. 2(b) and Fig. 2(c), the actor in Fig. 2(a) generates almost an uniform distribution, which means that the entropy regularization has a greater impact on the objective in (9). Besides, since the samples distribute more evenly over the allowed set, the predicted scores are close to the ground truth almost everywhere. With α\alpha decreasing, the sampling distribution narrows to the optimal area (Fig. 2(b) and Fig. 2(c)), which means that the actor can generate satisfying designs. However, the samples concentrate in some areas, which leads to worse performance of the critic in other areas, as illustrated in Fig. 2(c). We also test the performance of a deterministic actor, as in DDPG ( [5]), where the actor only generates a deterministic design, rather than a distribution over the design space. The final design falls into the local optimal more easily, as illustrated in Fig. 2(d). In real implementation, we usually set a larger α\alpha in the initial phase, and decrease it step by step. In this way, the sampling distribution can converge to the neighbor area of the global optimal, while avoiding falling into local optimal.

Refer to caption
(a) α=101\alpha=10^{-1}
Refer to caption
(b) α=102\alpha=10^{-2}
Refer to caption
(c) α=103\alpha=10^{-3}
Refer to caption
(d) α=103\alpha=10^{-3}
Figure 3: Results of Algorithm 2. We choose α=101\alpha=10^{-1}, α=102\alpha=10^{-2} and α=103\alpha=10^{-3} in the experiments. As in Fig. 2, the blue line is the ground truth of the simulation model f()f(\cdot) and the orange line is the value predicted by the critic Qϕ()Q^{\phi}(\cdot). We use histograms to illustrate the sampling distribution among the discrete 𝒳\mathcal{X}. The blue bars denote the sampling distribution Pθ()P^{\theta}(\cdot) generated by the actor and the orange bars denote the optimal distribution P()P^{*}(\cdot).

The simulation model (24) is modified to test Algorithm 2 by converting the design space 𝒳\mathcal{X} to a discrete set. The results are shown in Fig. 3. It can be found that by decreasing the strength of entropy regularization, the sampling distribution gradually converges to the optimal area. We also compare the effect of the entropy regularization by setting α\alpha as 10110^{-1}, 10210^{-2}, and α=103\alpha=10^{-3}. When a larger α\alpha is chosen, e.g., α=101\alpha=10^{-1} in Fig. 3(a), the actor generates nearly uniform distribution. When decreasing it, the learned distribution gradually converges to designs with higher scores, as in Fig. 3(b) and Fig. 3(c). However, if α\alpha is set too small, the actor may fall into local optimal, as illustrated in Fig. 3(d). Therefore, in real implementation, we also set a bigger α\alpha in the initial phase and decrease it step by step as in the continuous case. Furthermore, it can be found that the distribution Pθ()P^{\theta}(\cdot) generated by the actor is very close to the optimal one P()P^{*}(\cdot), which validates the theoretical analysis.

After the toy examples, we design more complex tasks to validate the proposed algorithms.

The first task is abstracted from adversarial attack tasks. In these tasks, we need to generate samples to “cheat” some NNs, which are trained to recognize the objects in the input pictures. In this experiment, we train a classifier to recognize digital numbers with the public data set MNIST. The classifier D()D(\cdot) takes an image xx^{\prime} as input and generates a distribution over Y={0,1,2,,9}Y=\{0,1,2,\cdots,9\}, which denotes the confidence of classifying the digital image to corresponding value. For the convenience of later representation, the distribution is denoted as D(x)R10D(x^{\prime})\in R^{10} and D(x,i),iYD(x^{\prime},i),i\in Y denotes the confidence of classifying the image xRW×Hx^{\prime}\in R^{W\times H} into ii, where by slight abuse of notations, here the WW and HH are the size of images. In the adversarial attack tasks, we need to generate some fake images to enhance the data set and train a more robust classifier.

Refer to caption
(a) Learning curve of the training process. The curve is drawn by averaging the confidences on 300300 samples generated by the actor after each episode.
Refer to caption
(b) Image generated by the actor after 10001000 episodes.
Figure 4: Learning results.

In order to generate the fake images, we construct the following simulation model:

f(x)=D(r(x+12),i),f(x)=D\left(r\left(\frac{x+1}{2}\right),i\right), (25)

where xR(W×H)x\in R^{(W\times H)} and r()r(\cdot) is the resizing operator converting the input vector to the appropriate shape. The linear mapping x+12\frac{x+1}{2} converts the value range from (1,1)(-1,1) to (0,1)(0,1), which is the same as the images in the date set. Then by solving the optimization problem (2), the actor can generate fake images which can mislead the classifier to believe it as ii. We choose i=1i=1 in our experiments. The learning curve is illustrated in Fig. 4(a), where we can find that within 10001000 episodes (only 1 sample is evaluated by f()f(\cdot) in each episode), the average confidence reaches over 0.950.95. Before the 500500-th episode, the average confidence is close to 0, which may be caused by the low-accuracy critic. After the 500500-th episode, the score increases gradually, which means both the actor and critic are improved. One of the images generated by the final actor is shown in Fig. 4(b). It is just some noise from the viewpoint of humans, but the classifier tends to predict it as the digital number “1”.

Another type of fake images are also preferred, which remain the features of the initial digital number, but mislead the classifier to make wrong decisions by adding some special noise. In order to generate these images, we construct following simulation model:

f(x)=D(c(r(x+12)×δ+xj),i),ijf(x)=D\left(c\left(r\left(\frac{x+1}{2}\right)\times\delta+x_{j}\right),i\right),~{}i\neq j (26)

where the operator cc clips the input variables to [0,1][0,1], and δ\delta is a small constant controlling the noise level, which is set as 0.20.2 in our experiment. Besides, xjx_{j} is an image from the initial data set, whose label is jj. In this case, the simulation-based optimization problem (2) tends to find special noise by adding which the classifier will wrongly classify the image from jj to ii.

Refer to caption

Figure 5: Result of adding special noise to the initial image. The orange curve is the confidence that the classifier recognizes the noisy image as “6”, and the blue curve is the confidence of recognizing it as “1”. The three images in the lower row are the initial image mixed with random noise generated at episode 0, 15001500 and 30003000, respectively.

The result is shown in Fig. 5. After about 20002000 episodes, the actor can generate noise “special” enough to mislead the classifier. We present the noisy image at episodes 0, 15001500, and 30003000. All of these images remain the features of “6” in the viewpoint of humans, but the classifier will be “cheated” to make wrong decisions.

Refer to caption
(a) CartPole-v0
Refer to caption
(b) Learning curve
Figure 6: CartPole-v0 environment and the training result. The CartPole-v0 environment takes 0 or 11 as action at each time step, which drives the black box toward left or right respectively. At most 200200 steps will be executed in each episode, and the early stop will be triggered if the pole falls down. Each step will be rewarded by +1+1. Fig. 6(b) shows the learning curves of Algorithm 1, which is drawn by averaging the “return” of 300300 independent experiments.

Finally, we apply Algorithm 1 to a more complex task, i.e., RL task. In typical policy-based and AC-based RL algorithms, the policy is parameterized by ζ\zeta and generates action corresponding to each state. In existing RL methods, the parameter ζ\zeta is adjusted by Policy Gradient (PG) algorithms, as in TRPO ( [11]), PPO( [12]), or Deterministic Policy Gradient (DPG) algorithms, as in DDPG ( [5]) and TD3 ( [6]), to reach higher accumulated reward. While in this work, we directly treat the task as a simulation-based model, where ζ\zeta is a concrete design. We take the “CartPole-v0” environment ( [13]) in Fig. 6(a) as example and deploy Algorithm 1 to it. The result is shown in Fig. 6(b), where we can find that the policy (generated design) is successfully reinforced to reach higher score. Although existing methods in RL area have better performance in these RL tasks, our methods offer a new perspective to robot control, which may be helpful to solve the problem of sparse and delayed reward widely existing in RL problems.

We design these examples to show the application of our proposed methods, as well as validate their performance on large-scale problems. The results will be further concluded in the next section.

V CONCLUSIONS

In this work, we consider the simulation-based optimization problem of selecting the best design with a computationally expensive evaluation model. The sampling process is bridged to the policy optimization problem and solved under the AC framework, where the critic serves as the surrogate model predicting the scores of untested designs, and the actor encodes the sampling distribution. We propose algorithms to update the actor and critic for continuous and discrete design space respectively, and design experiments to validate their effectiveness, as well as explain the applications. The results show that our methods can successfully find satisfying designs and avoid falling into the local optimal to a certain extent. We note that in the experiment of the RL task, we offer a new perspective to robot control of optimizing the policy generating process, rather than optimizing the policy itself. This may be helpful to solve the problem of sparse and delayed reward in existing works. The new perspective will also be studied in future work. Another direction that will be researched in the future is adjusting the hyper-parameter α\alpha automatically, as in [14].

References

  • [1] S. Boyd, S. P. Boyd, and L. Vandenberghe, Convex optimization.   Cambridge university press, 2004.
  • [2] M. Pelikan, D. E. Goldberg, E. Cantú-Paz, et al., “Boa: The bayesian optimization algorithm,” in Proceedings of the genetic and evolutionary computation conference GECCO-99, vol. 1.   Citeseer, 1999, pp. 525–532.
  • [3] R. M. Dudley, “Sample functions of the gaussian process,” Selected Works of RM Dudley, pp. 187–224, 2010.
  • [4] Y. Peng, E. K. Chong, C.-H. Chen, and M. C. Fu, “Ranking and selection as stochastic control,” IEEE Transactions on Automatic Control, vol. 63, no. 8, pp. 2359–2373, 2018.
  • [5] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra, “Continuous control with deep reinforcement learning,” arXiv preprint arXiv:1509.02971, 2015.
  • [6] S. Fujimoto, H. Hoof, and D. Meger, “Addressing function approximation error in actor-critic methods,” in International Conference on Machine Learning.   PMLR, 2018, pp. 1587–1596.
  • [7] T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, “Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor,” in International Conference on Machine Learning.   PMLR, 2018, pp. 1861–1870.
  • [8] R. Bellman, “Dynamic programming,” Science, vol. 153, no. 3731, pp. 34–37, 1966.
  • [9] C. J. Watkins and P. Dayan, “Q-learning,” Machine learning, vol. 8, no. 3-4, pp. 279–292, 1992.
  • [10] T. Haarnoja, H. Tang, P. Abbeel, and S. Levine, “Reinforcement learning with deep energy-based policies,” in International Conference on Machine Learning.   PMLR, 2017, pp. 1352–1361.
  • [11] J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz, “Trust region policy optimization,” in International conference on machine learning.   PMLR, 2015, pp. 1889–1897.
  • [12] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,” arXiv preprint arXiv:1707.06347, 2017.
  • [13] G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba, “Openai gym,” 2016.
  • [14] T. Haarnoja, A. Zhou, K. Hartikainen, G. Tucker, S. Ha, J. Tan, V. Kumar, H. Zhu, A. Gupta, P. Abbeel, et al., “Soft actor-critic algorithms and applications,” arXiv preprint arXiv:1812.05905, 2018.