This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Introduction to Quantum Reinforcement Learning: Theory and PennyLane-based Implementation

Yunseok Kwak, Won Joon Yun, Soyi Jung, Jong-Kook Kim, and Joongheon Kim School of Electrical Engineering, Korea University, Seoul, Republic of Korea
School of Software, Hallym University, Chungcheon, Republic of Korea
E-mails: [email protected], [email protected], [email protected],
[email protected], [email protected]
Abstract

The emergence of quantum computing enables for researchers to apply quantum circuit on many existing studies. Utilizing quantum circuit and quantum differential programming, many research are conducted such as Quantum Machine Learning (QML). In particular, quantum reinforcement learning is a good field to test the possibility of quantum machine learning, and a lot of research is being done. This work will introduce the concept of quantum reinforcement learning using a variational quantum circuit, and confirm its possibility through implementation and experimentation. We will first present the background knowledge and working principle of quantum reinforcement learning, and then guide the implementation method using the PennyLane library. We will also discuss the power and possibility of quantum reinforcement learning from the experimental results obtained through this work.

I Introduction

Since deep reinforcement learning has opened a new chapter in reinforcement learning by leveraging the power of artificial neural networks, there have been achievements such as surpassing human limits in complex games such as chess and Go, and it has already become an unavoidable flow of artificial intelligence (AI) research.

On the other hand, advanced quantum computing technology has already reached a level close to the implementation of quantum computational gain predicted through many algorithm studies [1, 2, 3]. In addition, the advent of a variational quantum circuit (VQC) that mimics the principles and functions of artificial neural networks has made it possible to apply these quantum calculations to existing machine learning algorithms. This has established itself as a major trend in quantum machine learning research, and many studies using it are being actively conducted. In this context, many studies are conducted using VQC as a Quantum Neural Network (QNN) [4, 5, 6, 7], including variational classifier, image preprocessor, federated learning, reinforcement learning, etc. Among them, in this paper, we introduce and discuss quantum reinforcement learning, a reinforcement learning model that replaces the artificial neural network of a deep Q network (DQN) with a VQC.

Since QPU has not been commercialized yet, there is nothing better in speed than the existing machine learning framework that utilizes NPU. However, as development libraries such as Tensorflow-quantum [8], Qiskit [9], and PennyLane [10] for future quantum computing environments and quantum computing clouds such as IBMQ, IonQ, and Amazon Braket are provided to developers, various studies on QML are in progress. Particularly, PennyLane [10], is a suitable library for starting quantum machine learning research because it provides a simulator that allows users to easily implement quantum circuits by using a CPU to perform QPU operations. Therefore, we aim to increase access to quantum reinforcement learning and facilitate subsequent research by briefly introducing the implementation process through PennyLane. In addition, we would like to discuss the impacts and potentials of quantum computing in reinforcement learning through experimentation and evaluation of quantum reinforcement learning models in the CartPole environment provided by OpenAI.

II Backgrounds

II-A Reinforcement Learning

Reinforcement learning is mathematically modeled with Markov Decision Process (MDP) as a tuple (𝒮,𝒜,P,R,T)(\mathcal{S},\mathcal{A},P,R,T), where 𝒮\mathcal{S} is a finite set of state information, and 𝒜\mathcal{A} is a finite set of action information. The function P:𝒮×𝒜P(𝒮)P:\mathcal{S}\times\mathcal{A}\to P(\mathcal{S}) is a transition probability function, with P(ss,a)P(s^{\prime}\mid s,a) being the probability of transitioning into state ss^{\prime} if an agent starts executing action aa in state ss. The function R:𝒮×𝒜×𝒮R:\mathcal{S}\times\mathcal{A}\times\mathcal{S}\to\mathbb{R} denotes the reward function, with Rt=R(st,at,st+1)R_{t}=R(s_{t},a_{t},s_{t+1}). The MDP has a finite time horizon TT, and solving an MDP means finding a optimal policy πθΠ:𝒮×𝒜[0,1]\pi_{\theta}^{*}\in\Pi:\mathcal{S}\times\mathcal{A}\to\left[0,1\right], where πθ\pi_{\theta} is neural network-based policy with parameter θ\theta; Observing ss, πθ\pi_{\theta} determines agent’s action a𝒜a\in\mathcal{A} to maximize the cumulative rewards received during the finite time TT.

When the environment transitions and the policy are stochastic, the probability of a TT-step trajectory is defined as P(τπθ)=ρ(s0)t=0T1P(st+1st,at)πθ(atst)P(\tau\mid\pi_{\theta})=\rho(s_{0})\prod_{t=0}^{T-1}P(s_{t+1}\mid s_{t},a_{t})\pi_{\theta}(a_{t}\mid s_{t}) where ρ\rho is the initial state distribution. Then, the expected return 𝒥(πθ)\mathcal{J}(\pi_{\theta}) is defined as 𝒥(πθ)=τP(τπθ)R(τ)=𝔼τπθ[R(τ)]\mathcal{J}(\pi_{\theta})=\int_{\tau}P(\tau\mid\pi_{\theta})R(\tau)=\mathbb{E}_{\tau\sim\pi_{\theta}}\left[R(\tau)\right] where the trajectory τ\tau is a sequence of states and actions in the environment. The objective of reinforcement learning is to learn a policy that maximizes the expected return 𝒥(πθ)\mathcal{J}(\pi_{\theta}) when the agent acts according to the policy πθ\pi_{\theta}. Therefore, the optimization objective is expressed by

πθ=argmaxθ𝒥(πθ)\pi_{\theta}^{*}=\arg\max_{\theta}\mathcal{J}(\pi_{\theta}) (1)

with πθ\pi_{\theta}^{*} being the optimal policy.

Deep Q-Network (DQN)[11]. One of the conventional method for solving MDP is Q-Learning. Q-Learning utilizes Q-table to find optimal policy. However, Q-Learning has limitation that it obtains optimal policy when the state dimension is small. Inspired to Q-Learning, deep Q-network (DQN) which is a model-free reinforcement learning, is proposed to learn the optimal policy with a high-dimensional state space. Experience replay 𝒟\mathcal{D} and target network are two key features used for training deep neural network with stabilization. Experiences et=(st,at,Rt+1,st+1)e_{t}=(s_{t},a_{t},R_{t+1},s_{t+1}) of the agent are stored in the experience buffer 𝒟=(e1,e2,,eT)\mathcal{D}=(e_{1},e_{2},\dots,e_{T}), and are periodically resampled to train the Q-networks. Sampled experience is used to update the parameters θi\theta_{i} of the policy with the loss function at the ii-th training iteration where the loss function is defined as

L(θi)=𝔼[(Rt+1+γmaxaQ(st+1,a;θi)Q(st,at;θi))2]L(\theta_{i})=\mathbb{E}\left[(R_{t+1}+\right.\\ \left.\gamma\max_{a^{\prime}}Q(s_{t+1},a^{\prime};\theta_{i}^{-})-Q(s_{t},a_{t};\theta_{i}))^{2}\right] (2)

where θi\theta^{-}_{i} are the target network parameters. The target network parameters θi\theta^{-}_{i} are updated using the Q-network parameters θ\theta in every predefined step. The stochastic gradient descent method is used to optimize the loss function.

Proximal Policy Optimization (PPO)[12]. PPO is one of the breakthroughs of DRL algorithms for improving the training stability by ensuring that πθ\pi_{\theta} updates at every iteration are small by clipping the probability ratio rπ(θ)=πθ(as)/πθold(as)r_{\pi}(\theta)=\pi_{\theta}(a\mid s)/\pi_{\theta_{old}}(a\mid s), where θold\theta_{old} is that of previous updated parameters of policy. (Schulman et al., 2017) proposed a surrogate function that has objective that prevents the new policy from straying away from the old one is used to train the policy πθ\pi_{\theta}. The clipped objective function is as follows:

LtCLIP(θ)=min(rt(θ)At,clip(rt(θ),1ϵ,1+ϵ)At),L^{\text{CLIP}}_{t}(\theta)=\min(r_{t}(\theta)A_{t},\textit{clip}(r_{t}(\theta),1-\epsilon,1+\epsilon)A_{t}), (3)

where AtA_{t} is the estimated advantage function under hyperparameter ϵ<1\epsilon<1, which means how far away the new policy is allowed to update from the old policy. PPO uses the stochastic gradient descent to maximize the objective (3).

II-B Quantum Computing

Quantum computers use a qubit as the basic unit of computation, which represent a quantum superposition state between two basis state |0|0\rangle and |1|1\rangle. It is controlled by unitary gates in a quantum circuit to perform various quantum operations. It can be represented as a normalized two-dimensional complex vector as:

|ψ=α|0+β|1,whereα22+β22=1,|\psi\rangle=\alpha|0\rangle+\beta|1\rangle,~{}\mathrm{where}~{}\|\alpha\|_{2}^{2}+\|\beta\|_{2}^{2}=1, (4)

and there also is a geometrical representation of a qubit space, using polar coordinates θ\theta and ϕ\phi:

|ψ=cos(θ/2)|0+eiϕsin(θ/2)|1,|\psi\rangle=\cos(\theta/2)|0\rangle+e^{i\phi}\sin(\theta/2)|1\rangle, (5)

where 0θπ0\leq\theta\leq\pi and 0ϕπ0\leq\phi\leq\pi. Qubit state is mapped into the surface of 3-dimensional unit sphere, which is called Bloch sphere. Quantum gate is a unitary operator transforming a qubit state into another qubit state, which can be represented as a 2×22\times 2 matrix with complex entries. There are some important quantum gates, Pauli-XX, Pauli-YY, and Pauli-ZZ, rotating by π\pi around their corresponding axes in Bloch sphere. The rotation operator gates Rx(θ),Ry(θ)R_{x}(\theta),R_{y}(\theta), and Rz(θ)R_{z}(\theta) rotate by θ\theta instead of π\pi in Pauli-XX, Pauli-YY, and Pauli-ZZ gates, and it is known that any single-qubit unitary gate in SU(2)SU(2) can be written as a product of three rotation operators of each axis. In addition, there are quantum gates which operate on multiple qubits, called controlled rotation gates. They act on a qubit according to the signal of several control qubits, which generates quantum entanglement between multiple qubits. Among them, Controlled X(or CNOT) gate is one of the most used control gates, changing the sign of the second qubit if the first qubit is |1|1\rangle. These gates allow quantum algorithms to work using their features on a quantum circuit that will be introduced later.

II-C Variational Quantum Circuit

The variational quantum circuit (or parameterized quantum circuit) is a quantum circuit using learnable parameters to perform various numerical tasks, such as optimization, approximation, and classification. Operation of general VQC model can be divided into 4 steps. First one is state preperation step, the input information is encoded into corresponding qubit states, which can be treated in the quantum circuit. Next step is variational step, entangling qubit states by controlled gates and rotating qubits by parameterized rotation gates. This process can be repeated in a multi-layer manner with more parameters, which possibly enhance the performance of the circuit. In the third step, processed qubit states are measured and decoded to the form of appropriate output information. Last step is conducted outside the circuit. The quantum circuit parameters are updated in the direction of optimizing the objective function of the algorithm by a classical CPU algorithm, like Adam optimizer. Then the circuit updated with the new parameters performs the calculation again from the beginning. This circuit is known to be able to approximate any continuous function like classical neural network[13], so VQC is often called Quantum Neural Network (QNN)[14]. It has been widely applied in quantum machine learning researches.

Initialize replay memory 𝒟\mathcal{D} to capacity NN
Initialize action-value function quantum circuit QQ with random parameters θ\theta
Initialize state value function V(s;ϕ)V(s;\phi)
for episode =1,2,,M=1,2,\ldots,M do
     Initialise state s1s_{1} and encode into the quantum state
     # 1. Inference Process #
     for t=1,2,,Tt=1,2,\ldots,T do
         With probability ϵ\epsilon select a random action ata_{t}
         otherwise select at=maxaQ(st,a;θ)a_{t}=\max_{a}Q^{*}(s_{t},a;\theta) from the output of the quantum circuit
         Execute action ata_{t} in emulator and observe reward rtr_{t} and next state st+1s_{t+1}
         Store transition (st,at,Rt,st+1)\left(s_{t},a_{t},R_{t},s_{t+1}\right) in 𝒟\mathcal{D}
     end for
     # 2. Training Process #
     for i=1,,Kepochi=1,...,K_{epoch} do
         Sample random mini-batch of transitions (sj,aj,Rj,sj+1)\left(s_{j},a_{j},R_{j},s_{j+1}\right) from 𝒟\mathcal{D}
         Calculate temporal difference target, yj={Rjfor terminal sj+1Rj+γmaxaQ(sj+1,a;θ)for non-terminal sj+1y_{j}=\left\{\begin{array}[]{l l}R_{j}&\text{for terminal }s_{j+1}\\ R_{j}+\gamma\max_{a^{\prime}}Q(s_{j+1},a^{\prime};\theta)&\text{for non-terminal }s_{j+1}\end{array}\right.
         Calculate temporal difference, δj=yjV(sj)\delta_{j}=y_{j}-V(s_{j})
         Calculate estimated advantage function, A^j=δj+(γλ)δj+1++(γλ)Jj+1δJ1\hat{A}_{j}=\delta_{j}+(\gamma\lambda)\delta_{j+1}+...+(\gamma\lambda)^{J-j+1}\delta_{J-1}
         Calculate ratio, rj=πθ(aj|sj)πθOLD(aj|sj)r_{j}=\frac{\pi_{\theta}(a_{j}|s_{j})}{\pi_{\theta_{OLD}}(a_{j}|s_{j})}
         Calculate surrogate actor loss function using (3)
         Calculate critic loss function, |V(s)yj||V(s)-y_{j}|.
         Calculate gradient and update actor and critic parameters
     end for
end for
Algorithm 1 Variational Quantum Deep Q Learning with PPO

III Quantum Reinforcement Learning

III-A Variational Quantum Policy Circuit

Refer to caption
Figure 1: A policy-VQC for deep-Q learning with parameter θ\theta.
Refer to caption
Figure 2: Quantum Reinforcement Learning System

In recent studies of quantum reinforcement learning [15, 16], VQC substitutes the policy training DNN of existing DRL. At each episode, agent with given state information determines its action from policy-VQC and parameters are updated with a classical CPU algorithm like Adam Optimizer. This paper approaches similarly, by using a VQC depicted in Fig. 1.
The quantum circuit in Fig. 1 is a prototype of policy-VQC, which consists of the basic structure of VQC. State encoding part of the circuit includes RyR_{y} gates parameterized by normalized state input ss, having their values between π-\pi and π\pi. Variational part in the center consists of entangling CXCX Gates and RxR_{x}, RyR_{y}, RzR_{z} gates parameterized with free parameter θ\theta. This part is called a layer, and several can be repeatedly stacked in a circuit. After that, measured output of the circuit is decoded into the action space, yielding the action probabilities. Then the obtained πθ\pi_{\theta} is evaluated and updated in a classical computer.

III-B Quantum Reinforcement Learning Systems

The quantum reinforcement learning system in this paper is as described in Fig. 2. In the beginning of an episode, quantum-classical hybrid agent receives a state information from the environment, and determine its action by θπ\theta_{\pi} made from the VQC. Then the policy of the agent is evaluated and updated by PPO algorithm, which is introduced before and described in Algorithm 1. The PPO algorithm are effectively the same as in the previous study[12]. The replay buffer functions in the same way as in traditional approaches, keeping track of the s,a,R,s\langle s,a,R,s^{\prime}\rangle tuples. One does not have to fundamentally or drastically change an algorithm in order to apply the power of VQCs to it. The algorithm presented in Algorithm 1.

#Parameterized Rotation & Entanglement Layers
def layer(W):
for i in range(n_qubit):
qml.RX(W[i,0], wires=i)
qml.RY(W[i,1], wires=i)
qml.RZ(W[i,2], wires=i)
#Classical Critic
class V(nn.Module):
def __init__(self):
super(V, self).__init__()
self.fc1 = nn.Linear(4,256)
self.fc_v = nn.Linear(256,1)
def forward(self,x):
x = F.relu(self.fc1(x))
v = self.fc_v(x)
return v
#Variational Quantum Policy Circuit (Actor)
@qml.qnode(dev, interface=’torch’)
def circuit(W,s):
# W: Layer Variable Parameters, s: State Variable
# Input Encoding
for i in range(n_qubit):
qml.RY(np.pi*s[i],wires=i)
#Variational Quantum Circuit
layer(W[0])
for i in range(n_qubit-1):
qml.CNOT(wires=[i,i+1])
layer(W[1])
for i in range(n_qubit-1):
qml.CNOT(wires=[i,i+1])
layer(W[2])
for i in range(n_qubit-1):
qml.CNOT(wires=[i,i+1])
layer(W[3])
qml.CNOT(wires=[0,2])
qml.CNOT(wires=[1,3])
return [qml.expval(qml.PauliY(ind)) for ind in range(2,4)]
#Declare Quantum Circuit and Parameters
W = Variable(torch.DoubleTensor(np.random.rand(4,4,3)),requires_grad=True)
v = V()
circuit_pi = circuit
optimizer1 = optim.Adam([W], lr=1e-3)
optimizer2 = optim.Adam(v.parameters(), lr=1e-5)

Figure 3: Variational Quantum Policy Circuit with PennyLane

IV The Implementation of QRL

IV-A Implementation Guidelines

As quantum computing and quantum machine learning research is actively progressing, many development libraries for researchers have emerged, such as TensorFlow-quantum, Qiskit, and PennyLane. Among them, PennyLane was created to support quantum machine learning research, allowing anyone to easily test the performance of quantum circuits through quantum simulators. The quantum simulator supported by PennyLane allows the CPU to imitate the operation of QPU, and especially supports the use of parameters in the form of PyTorch tensor and gradient operation [17]. Thanks to these features, PennyLane makes it easy for anyone who has previously done machine learning research using PyTorch to start researching quantum machine learning. Based on this background, in this paper, a quantum reinforcement learning model was implemented using PennyLane and PyTorch as shown in Fig. 3.

IV-B The CartPole Environment

Cartpole, the implementation environment in this paper, is a test environment for reinforcement learning provided by OpenAI [18]. This is a game where the agent moves the cart back and forth to avoid dropping the stick on the cart and the longer one holds the stick, the greater the reward. At every moment, the player observes the cart’s position, velocity, and the angle and angular velocity of the rod to determine which direction to accelerate accordingly. The VQC in Fig. 1 using 4 qubits is suitable for policy making in this environment. Each of the four pieces of information provided by the environment is normalized and fed into the circuit as values between π-\pi and π\pi, and the two measures are decoded into the probability values of taking two actions via the softmax function. This process continues until the agent can take the optimal action on the given state information by optimizing the given parameters in the reinforcement learning algorithm. The experimental result is showed later in this paper.

IV-C Experimental Setup

Our experiment is conducted with the software packages, PyTorch for speed and convenience of tensor operation and and PennyLane for quantum circuit simulation. The quantum simulator provided by Pennylane is very convenient to use, but it is difficult to use many qubits because of its slow computational speed. Therefore, the CartPole environment was used as a simple environment that can be operated with a circuit of small qubits. We used classical parameter optimizer as Adam optimizer with learning rate 0.001 for quantum policy and 0.00001 for classical critic. Other hyperparameter settings are γ\gamma = 0.98, λ\lambda = 0.95, ϵ\epsilon = 0.01. The baseline model is a random version of this model, using random parameters in every time step without optimization.

IV-D Experimental Results

Refer to caption
Figure 4: Comparison of Total Reward on Average in Environment (CartPole-v0)

Fig. 4 shows the performance of the proposed quantum reinforcement learning model. Comparing with random actions, one can see that the model is learning to find the optimal action. Also, it can be seen that the deviation of rewards during the learning process is extremely high. This is due to uncertainty within quantum systems, although the impact of reinforcement learning algorithms facilitating exploration cannot be ignored either. This uncertainty simultaneously implies the possibilities and limitations of quantum reinforcement learning. This allows effective policy exploration with relatively few tens of parameters, but makes it difficult to maintain the good results once reached. Leveraging these characteristics is an important challenge for quantum reinforcement learning.

V Conclusions and Future Work

Through this work, we implemented and tested a quantum reinforcement learning model in the CartPole environment based on PPO, one of the latest deep reinforcement learning techniques, and demonstrated implementation guidelines in the PennyLane library. We discussed the principle and potential of the reinforcement learning using a variational quantum circuit. Furthermore, this work aims to enable researchers who are new to quantum reinforcement learning to start research with interest. Although the performance of quantum reinforcement learning cannot be said to be better than that of the existing method, it is expected that many follow-up studies will yield results that exceed the limitations of existing reinforcement learning.

Acknowledgment

This work was supported by the National Research Foundation of Korea (2019M3E4A1080391). Joongheon Kim is a corresponding author of this paper.

References

  • [1] J. Kim, Y. Kwak, S. Jung, and J.-H. Kim, “Quantum scheduling for millimeter-wave observation satellite constellation,” in Proceedings of the IEEE VTS Asia Pacific Wireless Communications Symposium (APWCS), 2021, pp. 1–1.
  • [2] J. Choi and J. Kim, “A tutorial on quantum approximate optimization algorithm (QAOA): Fundamentals and applications,” in Proceedings of the IEEE International Conference on Information and Communication Technology Convergence (ICTC), 2019, pp. 138–142.
  • [3] J. Choi, S. Oh, and J. Kim, “The useful quantum computing techniques for artificial intelligence engineers,” in Proceedings of the IEEE International Conference on Information Networking (ICOIN), 2020, pp. 1–3.
  • [4] Y. Kwak, W. J. Yun, S. Jung, and J. Kim, “Quantum neural networks: Concepts, applications, and challenges,” CoRR, vol. abs/2108.01468, 2021.
  • [5] S. Oh, J. Choi, and J. Kim, “A tutorial on quantum convolutional neural networks (QCNN),” in Proceedings of the IEEE International Conference on Information and Communication Technology Convergence (ICTC), 2020, pp. 236–239.
  • [6] S. Oh, J. Choi, J.-K. Kim, and J. Kim, “Quantum convolutional neural network for resource-efficient image classification: A quantum random access memory (QRAM) approach,” in Proceedings of the IEEE International Conference on Information Networking (ICOIN), 2021, pp. 50–52.
  • [7] J. Choi, S. Oh, and J. Kim, “A tutorial on quantum graph recurrent neural network (QGRNN),” in Proceedings of the IEEE International Conference on Information Networking (ICOIN), 2021, pp. 46–49.
  • [8] M. Broughton, G. Verdon, T. McCourt, A. J. Martinez, J. H. Yoo, S. V. Isakov, P. Massey, M. Y. Niu, R. Halavati, E. Peters, M. Leib, A. Skolik, M. Streif, D. V. Dollen, J. R. McClean, S. Boixo, D. Bacon, A. K. Ho, H. Neven, and M. Mohseni, “Tensorflow quantum: A software framework for quantum machine learning,” arxiv, 2020.
  • [9] M. S. A. et al., “Qiskit: An open-source framework for quantum computing,” arxiv, 2021.
  • [10] V. Bergholm, J. Izaac, M. Schuld, C. Gogolin, M. S. Alam, S. Ahmed, J. M. Arrazola, C. Blank, A. Delgado, S. Jahangiri, K. McKiernan, J. J. Meyer, Z. Niu, A. Száva, and N. Killoran, “Pennylane: Automatic differentiation of hybrid quantum-classical computations,” arxiv, 2020.
  • [11] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller, “Playing Atari with deep reinforcement learning,” arXiv:1312.5602, 2013.
  • [12] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,” arXiv:1707.06347, 2017.
  • [13] J. Biamonte, “Universal variational quantum computation,” Physical Review A, vol. 103, no. 3, p. L030401, 2021.
  • [14] N. Wiebe, A. Kapoor, and K. M. Svore, “Quantum deep learning,” arXiv preprint arXiv:1412.3489, 2014.
  • [15] S. Y.-C. Chen, C.-H. H. Yang, J. Qi, P.-Y. Chen, X. Ma, and H.-S. Goan, “Variational quantum circuits for deep reinforcement learning,” IEEE Access, vol. 8, pp. 141 007–141 024, 2020.
  • [16] S. Jerbi, C. Gyurik, S. Marshall, H. J. Briegel, and V. Dunjko, “Variational quantum policies for reinforcement learning,” arXiv preprint arXiv:2103.05577, 2021.
  • [17] A. e. a. Paszke, “Pytorch: An imperative style, high-performance deep learning library,” in Advances in Neural Information Processing Systems (NIPS), 2019, pp. 8024–8035.
  • [18] G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba, “Openai gym,” 2016.