This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Quadruped Locomotion on Non-Rigid Terrain using Reinforcement Learning

Taehei Kim1 and Sung-Hee Lee1 1Graduate School of Cultural Technology, KAIST, Daejeon, South Korea. {hayleyy321|sunghee.lee}@kaist.ac.kr
Abstract

Legged robots need to be capable of walking on diverse terrain conditions. In this paper, we present a novel reinforcement learning framework for learning locomotion on non-rigid dynamic terrains. Specifically, our framework can generate quadruped locomotion on flat elastic terrain that consists of a matrix of tiles moving up and down passively when pushed by the robot’s feet. A trained robot with 55cm base length can walk on terrain that can sink up to 5cm. We propose a set of observation and reward terms that enable this locomotion; in which we found that it is crucial to include the end-effector history and end-effector velocity terms into observation. We show the effectiveness of our method by training the robot with various terrain conditions.

I Introduction

Quadruped robots have many advantages such as stability and terrain adaptivity, and thus researchers have put consistent efforts to enabling quadruped robot locomotion in various conditions and environments. Reinforcement learning is a powerful tool for this problem as it allows linear reward design, which reduces the burden of sophisticated design of controllers based on physics and finds control policies that are robust to the change of environments. Recently, deep reinforcement learning (DRL)-based methods have shown significant advances in quadruped locomotion tasks, such as walking with faster speed [1], recovery from falling [2], manipulation [3] and walking on diverse terrains [4] with faster convergence [5, 6].

Our work explores to extend the range of environments for DRL-based quadruped locomotion: while the majority of previous studies assume rigid environments, we show the possibility of reinforcement learning frameworks of learning locomotion on non-rigid dynamic terrains. As our real-life environment includes grounds that plastically or elastically deform under pushing forces, such as sands, shaky bridges, or trampoline, the ability to locomote on non-rigid terrains is important for quadruped robots. To the best of our knowledge, our work is the first DRL framework, albeit on a simulated environment, that learns to walk on non-rigid terrains. Figure 1 shows a snapshot of our result where a virtual Laikago robot walks on the non-rigid terrain.

Refer to caption

Figure 1: The real world includes various types of non-rigid terrain. We propose a DRL-based method for quadruped locomotion on flat elastic terrain.

As our main contribution, we develop a framework that allows a quadruped robot to walk on a flat, non-rigid terrain, which is modeled as a tiled ground with each tile elastically sinking with varying stiffness. Specifically, we propose a set of observation and reward terms that enable the locomotion on the non-rigid terrain. Our experiments show that memorizing the history of end-effector positions with some time interval and the end-effector velocity are important. The effectiveness and limitation of our framework are examined by comparing results obtained for different set of terrain environments of training.

II Related Work

Quadruped locomotion has been of great interest to researchers due to its possible utility. Starting from walking on flat rigid terrains, state-of-the-art methods aim to achieve more natural and agile movements in diverse environments. This section introduces studies that develop DRL-based methods for locomotion as well as studies to develop controllers for the similar environment like ours.

II-A Walking on Rigid Flat Terrain

Since Google DeepMind showed a DRL-based method for training virtual agents to locomote, DRL-based approaches to learning control policies for locomotion have been widely studied, e.g., [7]. [8] trains virtual quadrupeds to run in 2D terrains with gaps, and [9] develops a two-level planning framework for a biped agent to walk. Biped locomotion is also studied in [10] and [11]

An important problem is to deploy a learned policy in virtual environment to real robots. One way to solve this is to train a controller with randomized settings such as including noise to secure versatility. [12] successfully trains quadruped locomotion by including randomized physical environment and adding perturbations. [13] randomizes the dynamics of the simulator to train a control policy for a fetching robot. Studies, such as [14] and [15], show successful locomotion of small physical robot of Minitaur. [16] adopts the model-based reinforcement learning for Minitaur locomotion to reduce data collection during training for faster learning. [15] focuses on minimizing human interference when training on a real physical robot. [1] transfers policy learned in simulation to a physical robot using an actuator modeling. [5] introduces a constrained policy optimization for faster convergence. [17] applies animal behavior to real robots using the imitation learning approach, which uses reference motion data for an agent to follow to achieve challenging tasks.

II-B Walking on Various Terrain Environments

Recent advancements in DRL-based methods advance further to the locomotion in diverse environments. In simulated environments, [18] introduces locomotion of biped agents on stairs with diverse heights and tilts. For quadruped agents, [19] introduces walking through obstacle-filled or slippery environments by using a mixture of imitation learning, general adversarial network and reinforcement learning in a simulated environment. [4] introduces a method that mixes the benefit of the model-based planning and control approach and reinforcement learning to tackle environment with varying heights and gaps. [20] introduces methods based on modulated Bezier curve gaits which enables uneven terrain locomotion using only inertial measures in physical robot. The simulation environment added the nominal clearance height, virtual ground penetration depth of the Bezier curve and residual foot displacements to the open-loop Bezier curve which might not be necessarily match with the real physical parameter. [21] develops a teacher-student reinforcement learning framework that can create foot trajectory traversing multiple environments including water, mud, and rock-filled terrains. [3] goes one step further by introducing a physical quadruped robot that can juggle a rubber ball.

II-C Walking on Non-Rigid Terrain using Dynamic control

Some studies tackle non-rigid terrain locomotion problem by developing new controllers or an efficient contact dynamics model. [22] shows a quadruped balancing on balls based on a model predictive control in a simulated environment. [23] adopts a momentum-based controller to balance on non-rigid terrain using a relatively simple four-link planar robot. [24] introduces a controller tuned differently depending on rigid or non-rigid ground. [25] relaxes hard constraints of contact dynamics so that their nonlinear model predictive control can be solved efficiently subject to contact, allowing non-rigid terrain locomotion. [26] introduces an online method by feeding terrain knowledge to a whole-body control for contact consistent solution. [27] focuses on capturing the contact properties by developing a contact model that can be applied to direct trajectory planning. Our method tackles a similar environment where the terrain is elastically moving up and down when a quadruped steps on the ground. We propose another direction for the non-rigid terrain locomotion.

III Terrain and Robot Models

Refer to caption

Figure 2: Our model for the flat elastic terrain. Each tile is connected to the floor by a spring-loaded prismatic joint.

We construct a simulation environment by using PyBullet physics engine [28] and employ a Laikago robot model [29] as our target robot.

Construction of Non-Rigid Terrain

The non-rigid terrain model that we design is a flat elastic terrain consisting of a matrix of tiles that can move up and down passively when a robot’s foot pushes the tile, as shown in Fig. 2. Each tile, of which width and length being 20 cm each, is connected to the flat rigid base via a prismatic joint with a spring. The stiffness of the spring is adjusted to control the amount of sinking. For instance, a terrain with a 5 cm sinking depth is created by setting the spring stiffness so that the average sinking depth of four tiles pressed by legs reaches 5 cm when the robot stands still.

Refer to caption

Figure 3: This figure shows the joint structure of the robot model. Yellow curves denote the trajectories of the base and the foot determined by the learned policy.

Robot Model

The Laikago robot model is about 25 kg and has 12 degrees of freedom (DoFs), with a 2-DoF shoulder and a 1-DoF knee joint in each leg. Figure 3 shows the joint structure of the robot. The length of the base is 55 cm and that of the leg is about 50 cm.

IV Training

We now describe the design of our reinforcement learning framework for the quadruped on non-rigid terrains. A single locomotion cycle includes four phases: In each phase (0.75 sec), one foot takes off and lands while all other feet maintain contact with the ground. One action signal output by a control policy defines the movement of a robot for one entire phase, thus four consecutive action signals constitute one locomotion cycle.

IV-A Action

The target trajectories of the base and the moving foot in a phase are modeled with cubic Bezier curves, and our action defines the control points of the Bezier curves. One advantage of the Bezier curve is that the resulting trajectories are smooth. Cubic Bezier curve is chosen because it is the lowest order that allows enough diversity in the curve configuration. A total of three 3D cubic Bezier curves are used for the base position, the base orientation, and the swing foot. In each Bezier curve, the first control point is automatically determined as the current value of the trajectory, and the remaining three control points are defined by the action. The coordinates of control points are expressed with respect to the base frame. As a result, one action includes a 27D vector.

Confining Action Space

We found that providing a exploration range in the action space is important to obtain successful convergence of a learned policy. In our case, the bounds describe the possible range of the Bezier curve control points. Specifically, for the base position, the second and third control points are confined within [-6, 6] cm, and the last control point is limited to [-4, 8] cm in each coordinate axis from the current position, which is the coordinates of the first control point. For the base orientation, all three control points are bounded to [-0.3, 0.3] rad in each Euler angle coordinate from the current orientation. Confining action space for the foot trajectory is a bit more complicated. The height of the control points are bounded to [-15, 15] cm from the current height. On the other hand, the frontal and lateral coordinates are bounded with respect to their default positions when the robot takes the initial squatting pose. The lateral coordinate is bounded to [-15, 15] cm from the default position. For the frontal coordinate, the frontal feet is bounded to [-15, 15] cm while the hind feet is bounded to [δ\delta-15, δ\delta+15] cm from the default position. The amount of shift δ\delta is set -2cm.

Motion Generation

As a control signal, the desired angle for each joint is generated approximately every 4 milliseconds. For this, the created Bezier curves are divided into 180 points, and inverse kinematics is solved to obtain the desired joint angles to achieve the target configurations of the base and feet, specified by the Bezier curve points. The calculated joint angles are provided to the joint position controller in pyBullet to generate joint torques to achieve the desired joint angles.

IV-B Observation

Our method generates action signals at relatively long time intervals of 0.75 seconds. The amount of information that the robot can collect during the decision interval is very large. For example, robot can track its joint angles in every few millisecond. As such, we need to select only partial information as observation to keep the size of neural networks to a reasonable level. The attributes in our observation space consist of only those measurable by the real robot. Our observed attributes can be categorized into three types; 1) values that are directly sensed by the robot 2) values that are induced from the sensed values 3) values related to the designed goals. Our observation consists of the following attributes which add up to a 102D vector.

  • The height of the base hbh_{b}, represented as the height of the four joints positions at the base from the terrain right below each joint (4D).

  • The orientation of the base represented by the direction of the gravity vector with respect to the base frame (3D).

  • The linear and angular velocities of the base (6D).

  • The pitch angle of the base (1D).

  • The history of the four end-effector positions in its base frame at the start of three previous phases and at the current time step (48D).

  • The history of the four end-effector positions in its base frame at 4 and 8 milliseconds before the current time step (24D).

  • The velocity of the four end-effectors (12D).

  • The direction of the goal (azimuth angle) ϕg\phi_{g} from the base frame (1D).

  • The position of the goal ρg\rho_{g} in the base frame. Only the frontal and lateral coordinates are used (2D).

  • Current phase (1D).

IV-C Reward Function

Our goal is to make the robot walk on non-rigid terrains without falling only with a small number of reward terms. A total of five reward terms are used.

IV-C1 Goal distance reward

It checks whether the robot is moving towards the goal.

Rd=αd(ρgρg,p)R_{d}=\alpha_{d}\left(||\rho_{g}||-||\rho_{g,p}||\right) (1)

where ρg,p\rho_{g,p} is the target position with respect to the robot’s base frame at the start of the previous phase, and the scaling factor αd=10\alpha_{d}=10 if ρg,t1>ρg,t||\rho_{g,t-1}||>||\rho_{g,t}||, and αd=1\alpha_{d}=1, otherwise.

IV-C2 Goal orientation reward

It checks whether the agent is heading in the right direction by giving a positive reward if the azimuth angle of the goal from the base is less than 10 degrees.

Ro=max(0,0.02×(10|ϕg|))R_{o}=\max\left(0,0.02\times(10-|\phi_{g}|)\right) (2)

IV-C3 Minimum height reward

It promotes that the robot’s base is above the minimum height from the terrain. If the height hbh_{b} of the base is larger than 25 cm, the robot receives Rs=0.1R_{s}=0.1.

IV-C4 Torque minimizing reward

It encourages the robot to use less torque while achieving the goal.

Rt=max(0,0.004×(τthreshτave)),R_{t}=\max\left(0,0.004\times\left(\tau_{thresh}-\tau_{ave}\right)\right), (3)

where τave\tau_{ave} is the average magnitude of the joint torque vector during a phase and τthresh=140\tau_{thresh}=140 in our experiment.

IV-C5 Roll angle reward

It encourages the robot to stabilize its roll angle.

Rr=max(0,2×(0.1|φ|))R_{r}=\max\left(0,2\times(0.1-|\varphi|)\right) (4)

where φ\varphi is the roll angle represented in radian angle.

The total reward is thus set as R=Rd+Ro+Rs+Rt+RrR=R_{d}+R_{o}+R_{s}+R_{t}+R_{r}.

IV-D Termination Condition

We employ early termination to avoid falling into local minima and to enhance the sample efficiency, as suggested by [30], [31], and [32]. In our framework, if one of the following conditions is met, the current episode is terminated with the agent receiving -10 and the training is restarted from a new initial state. Thus, the policy is trained not to fall into early termination conditions.

  • The base height decreases below 20 cm, which indicates that the robot’s configuration is near to the joint limit.

  • The pitch angle of the base exceeds an allowed range (-15 to 15 degrees), which indicates the robot is inclined too much.

  • Any link except end-effector collides with the ground.

IV-E Training and Implementation Details

We first initialize our robot to a stable squat pose with a low center of mass height, as shown in Fig. 3. The robot has a fixed phase transition order: it moves the legs in the order of front-left, rear-right, front-right, and rear-left.

In the training stage, we use 4 types of non-rigid terrains with varying sinking depths of 2, 3, 4 and 5 cm as well as a rigid terrain. The training always starts with a terrain with 2 cm sinking depth but after that the terrain type is changed randomly every N (2 or 8) meters. The initial position of the robot is randomly changed as well. We train our robot by giving a specific target in 2 meters front. Once it reaches its initial target, the next goal is given again in 2 meters ahead repeatedly.

We use OpenAI Gym [33] to create the learning environment and adopt PPO algorithm [34] provided by Stable Baselines [35]. The policy and value networks have an identical structure of 2 hidden layers with 256 and 128 perceptrons each. We use tanh for the activation function. The discount factor is 0.95 and the policy learning rate is 2×1042\times 10^{-4}. The size of minibatch is 4096 and the PPO epoch is 10.

V Results

Refer to caption

Figure 4: From (a) to (d), walking on non-rigid terrain with a 5 cm sinking depth.

Refer to caption

Figure 5: Walking on non-rigid terrain where the sinking depth changes every two meters. (a) Transition from 3 cm to 2 cm sinking depth. (b) Transition from 5 cm to rigid ground. (c) Walking on 4 cm sinking depth terrain.

In this section, we examine the characteristics and the effectiveness of our method. For graphical view, Fig. 4 shows our result of quadruped walking on sinking depths of 5 cm, and Fig. 5 shows quadruped walking on terrain with varying stiffness. We first examine the characteristics of our framework by analysing the trajectories of the base height and the target landing height of each foot on different terrains. Second, we identify some crucial components in our method. Third, we discuss the effect of other observation parameters that improve the stability of locomotion. Lastly, we share our experiments of increasing the observation space.

V-A Training on Different Terrains

We examine how our method results in different policies depending on the terrain conditions for training. Our training environments include four scenarios.

  • Tv2T_{v2} (Baseline): Terrain whose stiffness changes every 2 meters.

  • Tv8T_{v8}: Terrain whose stiffness changes every 8 meters.

  • Tc2T_{c}^{2}: Terrain of constant stiffness with 2 cm sinking depth.

  • Tc5T_{c}^{5}: Terrain of constant stiffness with 5 cm sinking depth.

For Tc5T_{c}^{5}, training succeeds with curriculum learning by first training with Tv2T_{v2}, followed by training with Tc5T_{c}^{5}. To compare side by side, we also apply the same curriculum learning for Tc2T_{c}^{2}.

Refer to caption
Figure 6: Trajectories of the base height hbh_{b} (cm) on different terrains. X-axis denotes the phase angle, from front-left swing [0, 0.5π\pi] to rear-left swing [1.5π\pi,2π\pi].
TABLE I: Mean and standard deviation of the base height hbh_{b}, and the target landing height of each foot (cm).
Tv2aveT_{v2}^{ave} Tv22T_{v2}^{2} Tv25T_{v2}^{5} Tv82T_{v8}^{2} Tv85T_{v8}^{5} Tc2T_{c}^{2} Tc5T_{c}^{5}
σ(hb)\sigma(h_{b}) 3.13.1 3.23.2 3.23.2 3.23.2 3.13.1 3.33.3 2.92.9
μ(hb)\mu(h_{b}) 33.133.1 33.733.7 32.832.8 33.033.0 32.632.6 35.835.8 33.533.5
μ(fr)\mu(fr) 13.813.8 13.013.0 12.612.6 6.36.3 6.86.8 6.96.9 14.414.4
μ(fl)\mu(fl) 5.45.4 5.95.9 4.54.5 0.7-0.7 0.20.2 1.11.1 6.16.1
μ(rr)\mu(rr) 1.7-1.7 2.3-2.3 2.9-2.9 7.7-7.7 5.6-5.6 6.3-6.3 1.8-1.8
μ(rl)\mu(rl) 9.7-9.7 10.5-10.5 10.1-10.1 14.8-14.8 11.8-11.8 14.8-14.8 8.9-8.9

V-A1 Base height

Figure 6 shows the trajectories of the base height on different terrains. They show a similar pattern over the terrains: The base descends then ascends during the front leg swing phases ([0,0.5π\pi] and [π\pi,1.5π\pi]) and descends during the rear leg swing phases. In all scenario, we can observe the smooth movement of the base position. Table I shows the means (μ\mu) and standard deviations (σ\sigma) of the base height in various terrain conditions: Tv2aveT_{v2}^{ave} is the value averaged over various terrains in Tv2T_{v2}, and Tv22T_{v2}^{2} is the value on tiles with 2 cm sinking depth in Tv2T_{v2}. The table shows that the base height has similar means and standard deviations in all scenarios, suggesting that our method produces stable results over terrain variations.

V-A2 Target landing height of each foot

Table I also shows the target landing height of each foot (flfl: front-left, rrrr: rear-right) on different terrains. The height is measured relative to the height of each foot at the default squat pose (Fig. 3).

In all scenarios, the robot learns to take different actions between the left and right legs. Our framework has a fixed phase order starting from the left side, which seems to make the robot rely on the left side more than the right side.

First, Tc2T_{c}^{2} and Tc5T_{c}^{5} show different landing heights for all feet. This suggests that our framework learns different policy depending on the terrain stiffness if the stiffness is constant. Then would it learn to take different actions for terrain with varying stiffness? The target landing heights for Tv22T_{v2}^{2} and Tv25T_{v2}^{5} are similar over all feet, so our framework cannot learn to vary the action against the terrain stiffness if the training environment changes the stiffness every 2 meters. In this scenario, Tv2aveT_{v2}^{ave} is quite similar to Tc5T_{c}^{5}, which suggests that our framework learns to take a conservative policy as if all terrains have the maximum sinking depth if the terrain stiffness changes every 2 meters.

In contrast, Tv82T_{v8}^{2} and Tv85T_{v8}^{5} show different actions, especially for the rear feet, showing that our framework learns to take different actions if trained with terrain that changes its stiffness every 8 meters. The terrains Tc2T_{c}^{2} and Tv82T_{v8}^{2} show very similar results: The robot takes the same action for Tv82T_{v8}^{2} as Tc2T_{c}^{2}. However, Tv85T_{v8}^{5} shows different results from Tc5T_{c}^{5}: The target landing heights of the rear feet of Tv85T_{v8}^{5} are between those of Tc2T_{c}^{2} and Tc5T_{c}^{5}. This suggests that the framework does not reach the optimal policy for Tv85T_{v8}^{5} terrain. Nonetheless, it is noteworthy that it learns to diverse its action for Tv85T_{v8}^{5} terrain. The standard deviation is less than 1cm for Tv82T_{v8}^{2}, but it is around 2cm and 3cm for the frontal and rear feet, respectively, which means that the robot takes much diverse action to step on tiles with different heights on Tv85T_{v8}^{5} terrain.

V-B Crucial Components for Non-Rigid Terrain Locomotion

We discuss key components that make locomotion on non-rigid terrain successful.

V-B1 End-effector position history

The first crucial component is the history of end-effector positions term in observation. It provides the memory of the robot state at four time steps (3 seconds) at the start of each phase, which seems to help the robot cope with the bounciness of the terrain. The learning failed with the memory of even one less phase. We conjecture that providing the end-effector position history also helps the robot adapt its locomotion pattern to the terrain condition; different terrain conditions lead to different end-effector position pattern, which can be used for the robot to take different action strategy.

V-B2 End-effector velocity

The second crucial component of our framework is the end-effector velocity in observation. Without this term, the robot fails to learn to walk, instead it learns just to keep balance without moving forward.

V-B3 Restricted action range

Another important component of our framework is confining the action space. If the allowed range of action space is too large, the policy fails to learn. When it is too small, the policy either 1) fails by not finding a successful policy to overcome the thresholds made by different tile heights or 2) falls into making inefficient movements, such as moving too little by little.

V-C Other Observation Terms

V-C1 End-effector positions at previous 4 and 8 milliseconds

The history of the end-effector positions of one cycle is not sufficient to teach the robot whether its foot is currently stuck at a threshold. This additional information helps the robot learn to raise its feet to avoid foot traps or standing still.

V-C2 The base orientation terms

The gravity direction vector and the pitch angle terms help the robot stabilize the base movement, including reducing unnatural movement such as heading to the sky or to the ground.

V-D Adding More Observations

We conduct several experiments to see whether including more observations obtains a higher reward or reduces the training time. First, including a longer history of the end-effector velocities leads to excessive movement of the base. Second, including a twice denser history of end-effector positions within one cycle only slows down the learning process significantly. Lastly, adding joint position and velocity information does not bring a noticeable change to the result. Since our proposed observation already includes the information on the pose of the robot, adding extra information about joint position and velocity does not seem to benefit.

VI Limitations and Future Work

Our work has a number of limitations that need to be overcome with future research. One major limitation is that our method is not agilely interactive to the ground because the motion is planned only once per phase while the terrain is dynamically moving when pushed by the feet. This decreases the robot’s responsiveness to the change of terrain stiffness and to the case when the foot is caught on the threshold. One way to increase the responsiveness would be augmenting with an additional lower-level controller that learns to promptly modify the planned motion trajectories according to the terrain conditions. Another more straightforward way would be to design a single-level DRL framework in which the learned policy outputs control commands at each control time step.

In this work, we only tested with flat elastic terrains. Interesting future work is to explore other types of non-rigid terrains, such as sloped terrains and plastically deforming terrains, which are frequently found in the real world.

References

  • [1] J. Hwangbo, J. Lee, A. Dosovitskiy, D. Bellicoso, V. Tsounis, V. Koltun, and M. Hutter, “Learning agile and dynamic motor skills for legged robots,” Science Robotics, vol. 4, no. 26, 2019. [Online]. Available: https://robotics.sciencemag.org/content/4/26/eaau5872
  • [2] J. Lee, J. Hwangbo, and M. Hutter, “Robust recovery controller for a quadrupedal robot using deep reinforcement learning,” 2019.
  • [3] F. Shi, T. Homberger, J. Lee, T. Miki, M. Zhao, F. Farshidian, K. Okada, M. Inaba, and M. Hutter, “Circus anymal: A quadruped learning dexterous manipulation with its limbs,” 2020.
  • [4] V. Tsounis, M. Alge, J. Lee, F. Farshidian, and M. Hutter, “Deepgait: Planning and control of quadrupedal gaits using deep reinforcement learning,” IEEE Robotics and Automation Letters, vol. 5, no. 2, pp. 3699–3706, 2020.
  • [5] S. Gangapurwala, A. Mitchell, and I. Havoutis, “Guided constrained policy optimization for dynamic quadrupedal robot locomotion,” IEEE Robotics and Automation Letters, vol. PP, pp. 1–1, 03 2020.
  • [6] G. Bellegarda and K. Byl, “Training in task space to speed up and guide reinforcement learning,” in 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2019, pp. 2693–2699.
  • [7] N. Heess, D. TB, S. Sriram, J. Lemmon, J. Merel, G. Wayne, Y. Tassa, T. Erez, Z. Wang, S. M. A. Eslami, M. Riedmiller, and D. Silver, “Emergence of locomotion behaviours in rich environments,” 2017.
  • [8] X. B. Peng, G. Berseth, and M. van de Panne, “Terrain-adaptive locomotion skills using deep reinforcement learning,” ACM Trans. Graph., vol. 35, no. 4, July 2016. [Online]. Available: https://doi.org/10.1145/2897824.2925881
  • [9] X. B. Peng, G. Berseth, K. Yin, and M. Van De Panne, “Deeploco: Dynamic locomotion skills using hierarchical deep reinforcement learning,” ACM Trans. Graph., vol. 36, no. 4, July 2017. [Online]. Available: https://doi.org/10.1145/3072959.3073602
  • [10] Z. Xie, G. Berseth, P. Clary, J. Hurst, and M. van de Panne, “Feedback control for cassie with deep reinforcement learning,” 2018.
  • [11] W. Yu, G. Turk, and C. K. Liu, “Learning symmetric and low-energy locomotion,” ACM Transactions on Graphics, vol. 37, no. 4, p. 1–12, Aug 2018. [Online]. Available: http://dx.doi.org/10.1145/3197517.3201397
  • [12] J. Tan, T. Zhang, E. Coumans, A. Iscen, Y. Bai, D. Hafner, S. Bohez, and V. Vanhoucke, “Sim-to-real: Learning agile locomotion for quadruped robots,” CoRR, vol. abs/1804.10332, 2018. [Online]. Available: http://arxiv.org/abs/1804.10332
  • [13] X. Peng, M. Andrychowicz, W. Zaremba, and P. Abbeel, “Sim-to-real transfer of robotic control with dynamics randomization,” 05 2018, pp. 1–8.
  • [14] T. Haarnoja, S. Ha, A. Zhou, J. Tan, G. Tucker, and S. Levine, “Learning to walk via deep reinforcement learning.” in Robotics: Science and Systems, 2019. [Online]. Available: https://doi.org/10.15607/RSS.2019.XV.011
  • [15] S. Ha, P. Xu, Z. Tan, S. Levine, and J. Tan, “Learning to walk in the real world with minimal human effort,” 2020.
  • [16] Y. Yang, K. Caluwaerts, A. Iscen, T. Zhang, J. Tan, and V. Sindhwani, “Data efficient reinforcement learning for legged robots,” 2019.
  • [17] X. B. Peng, E. Coumans, T. Zhang, T.-W. Lee, J. Tan, and S. Levine, “Learning agile robotic locomotion skills by imitating animals,” 2020.
  • [18] Z. Xie, H. Y. Ling, N. H. Kim, and M. van de Panne, “Allsteps: Curriculum-driven learning of stepping stone skills,” 2020.
  • [19] Y.-S. Luo, J. H. Soeseno, T. P.-C. Chen, and W.-C. Chen, “Carl: Controllable agent with reinforcement learning for quadruped locomotion,” ACM Trans. Graph., vol. 39, no. 4, July 2020. [Online]. Available: https://doi.org/10.1145/3386569.3392433
  • [20] M. Rahme, I. Abraham, M. L. Elwin, and T. D. Murphey, “Dynamics and domain randomized gait modulation with bezier curves for sim-to-real legged locomotion,” 2020.
  • [21] J. Lee, J. Hwangbo, L. Wellhausen, V. Koltun, and M. Hutter, “Learning quadrupedal locomotion over challenging terrain,” Science Robotics, vol. 5, no. 47, 2020.
  • [22] C. Yang, B. Zhang, J. Zeng, A. Agrawal, and K. Sreenath, “Dynamic legged manipulation of a ball through multi-contact optimization,” 2020.
  • [23] M. Azad and M. N. Mistry, “Balance control strategy for legged robots with compliant contacts,” in 2015 IEEE International Conference on Robotics and Automation (ICRA), 2015, pp. 4391–4396.
  • [24] W. Bosworth, J. Whitney, S. Kim, and N. Hogan, “Robot locomotion on hard and soft ground: Measuring stability and ground properties in-situ,” 05 2016, pp. 3582–3589.
  • [25] M. Neunert, M. Stauble, M. Giftthaler, C. D. Bellicoso, J. Carius, C. Gehring, M. Hutter, and J. Buchli, “Whole-body nonlinear model predictive control through contacts for quadrupeds,” IEEE Robotics and Automation Letters, vol. 3, no. 3, p. 1458–1465, Jul 2018. [Online]. Available: http://dx.doi.org/10.1109/LRA.2018.2800124
  • [26] S. Fahmi, M. Focchi, A. Radulescu, G. Fink, V. Barasuol, and C. Semini, “Stance: Locomotion adaptation over soft terrain,” IEEE Transactions on Robotics, vol. PP, pp. 1–15, 01 2020.
  • [27] I. Chatzinikolaidis, Y. You, and Z. Li, “Contact-implicit trajectory optimization using an analytically solvable contact model for locomotion on variable ground,” IEEE Robotics and Automation Letters, vol. 5, no. 4, p. 6357–6364, Oct 2020. [Online]. Available: http://dx.doi.org/10.1109/LRA.2020.3010754
  • [28] E. Coumans and Y. Bai, “Pybullet, a python module for physics simulation for games, robotics and machine learning,” http://pybullet.org, 2016–2019.
  • [29] Z. Chen, “Laikago,” 2018. [Online]. Available: https://www.unitree.com/products/laikago
  • [30] X. B. Peng, P. Abbeel, S. Levine, and M. van de Panne, “Deepmimic: Example-guided deep reinforcement learning of physics-based character skills,” ACM Trans. Graph., vol. 37, no. 4, July 2018. [Online]. Available: https://doi.org/10.1145/3197517.3201311
  • [31] J. Won and J. Lee, “Learning body shape variation in physics-based characters,” ACM Trans. Graph., vol. 38, no. 6, Nov. 2019. [Online]. Available: https://doi.org/10.1145/3355089.3356499
  • [32] J. Won, D. Gopinath, and J. Hodgins, “A scalable approach to control diverse behaviors for physically simulated characters,” ACM Trans. Graph., vol. 39, no. 4, July 2020. [Online]. Available: https://doi.org/10.1145/3386569.3392381
  • [33] G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba, “Openai gym,” 2016.
  • [34] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,” 2017.
  • [35] A. Hill, A. Raffin, M. Ernestus, A. Gleave, A. Kanervisto, R. Traore, P. Dhariwal, C. Hesse, O. Klimov, A. Nichol, M. Plappert, A. Radford, J. Schulman, S. Sidor, and Y. Wu, “Stable baselines,” https://github.com/hill-a/stable-baselines, 2018.