\JournalSubmission\BibtexOrBiblatex\electronicVersion\PrintedOrElectronic

\teaser

Push-recovery experiments. The resilience and robustness of biped locomotion can be qualitatively characterized and quantitatively assessed from the analysis of the post-push foot placements. Our experiments show that DRL policies are as robust as human walking.

Understanding the Stability of Deep Control Policies for Biped Locomotion

Hwangpil Park¹, Ri Yu¹, Yoonsang Lee², Kyungho Lee³ and Jehee Lee¹
¹Department of Computer Science and Enginering, Seoul National University, Computer South Korea ²Department of computer Science, Hanyang University, South Korea ³NC Soft, South Korea Corresponding Author, [email protected]

Abstract

Achieving stability and robustness is the primary goal of biped locomotion control. Recently, deep reinforce learning (DRL) has attracted great attention as a general methodology for constructing biped control policies and demonstrated significant improvements over the previous state-of-the-art. Although deep control policies have advantages over previous controller design approaches, many questions remain unanswered. Are deep control policies as robust as human walking? Does simulated walking use similar strategies as human walking to maintain balance? Does a particular gait pattern similarly affect human and simulated walking? What do deep policies learn to achieve improved gait stability? The goal of this study is to answer these questions by evaluating the push-recovery stability of deep policies compared to human subjects and a previous feedback controller. We also conducted experiments to evaluate the effectiveness of variants of DRL algorithms.

{CCSXML}

<ccs2012> <concept> <concept_id>10010147.10010257.10010258.10010261</concept_id> <concept_desc>Computing methodologies Reinforcement learning</concept_desc> <concept_significance>300</concept_significance> </concept> <concept> <concept_id>10010147.10010371.10010352.10010379</concept_id> <concept_desc>Computing methodologies Physical simulation</concept_desc> <concept_significance>300</concept_significance> </concept> </ccs2012>

\ccsdesc

[300]Computing methodologies Reinforcement learning \ccsdesc[300]Computing methodologies Physical simulation

\printccsdesc

keywords:

Biped Locomotion, Physically Based Simulation, Push-Recovery Stability, Deep Reinforcement Learning, Gait Analysis

1 Introduction

The simulation and control of human locomotion is a central issue in physically-based animation. The design principles of biped controllers have pursued several fundamental goals: Simulated locomotion should look human-like, should be resilient against unexpected disturbances, and should be interactively controllable to change its walking direction and speed. It is also desirable for the controller to be able to generate various gait patterns and transitions. We are particularly interested in the second goal and understanding the balance recovering capability of biped controllers.

Many design approaches have been explored to construct robust biped controllers: feedback control laws [HWBO95, YLvdP07], data-driven control (in the sense that the controller mimics a reference motion) [SKL07, LKL10, LvdPY16], nonlinear/stochastic optimization [dLMH10, WFH10, KH17], model-based optimization [CBvdP10, MdLH10, TLC^∗09], and reinforcement learning [PALvdP18, YTL18]. In particular, recent advances in deep reinforcement learning (DRL) have made significant improvements in the simulation and control of biped locomotion. There are many variants of DRL algorithms for learning locomotion control. A typical example-guided algorithm takes a short motion clip as input and learns a control policy (a.k.a. controller) that allows the biped to imitate the dynamic behaviors captured in the motion data [PALvdP18]. Alternatively, DRL algorithms can produce control policies conditioned by continuous, user-controllable gait parameters, which may include walking speeds, steering angles, body shape variations (e.g., leg/arm lengths), and a large repertoire of action choices captured in unorganized motion data sets [WL19, BCHF19, PRL^∗19].

Although deep control policies are substantially more robust than the previous state-of-the-art, many questions remain unanswered. Are deep control policies as robust as the balance-recovering capability of human locomotion? Do simulated locomotion use similar strategies as human locomotion to maintain balance? Does a particular gait pattern similarly affect human and simulated locomotion? Are conditioned control policies as robust as unconditioned control policies without adjustable parameters? What do deep policies learn to achieve robustness in locomotion?

In this paper, we evaluate the push-recovery stability of deep policies under various conditions (e.g., walking speed, stride length, the level of crouch, push timing, and push force). To do so, we conducted simulation-based stability tests with each control policy we trained and compared its stability with previous human/simulation experiments. The push-recovery stability measures how well the simulated biped withstands impulsive pushes. More specifically, two measures are adopted in our experiments: maximum detour distance and fall-over rate. The detour distance measures how far the biped detours in the direction of modest pushes to assess the resilience of human walking without falling over. The fall-over rate is more popular for assessing the stability of simulated controllers because the experiments can be conducted safely in the simulation environment with a wider range of push magnitudes. The characteristics of control policies are analyzed qualitatively and quantitatively based on post-push foot placement patterns.

We also evaluate the effectiveness of DRL variants. Gait-conditioned policies have many advantages over unconditioned, gait-specific policies in terms of computational time and memory usage. It has been believed that those advantages are gained at the cost of sacrificing the balance capabilities to some extent. In our experiments, we observed that gait-conditioned policies are not necessarily inferior to gait-specific policies in terms of push-recovery stability. It has been also found that adaptive sampling in the gait parameter domain results in more robust policies than naïve non-adaptive learning, and learning with random pushes results in more robust policies than learning without random pushes. Random disturbances in the learning process not only improve resilience but also allow DRL policies to better emulate human balance strategies in the foot placement analysis. Overall, we found that DRL policies are as robust as human walking.

2 Related work

2.1 Physics-based Simulation and Control

The design of biped controllers that produce realistic human walking has been a challenging subject in computer graphics. The key challenge is designing a balancing mechanism, which is usually implemented as a feedback loop that adjusts the controller output (joint torques or PD target poses) based on its input (body state and environment information). A variety of control approaches has been explored to generate responsive and realistic human locomotion with diverse feedback mechanisms. Finite state machines equipped with manually-designed, intuitive feedback rules have been an effective approach in the early biped controller design [HWBO95, YLvdP07]. To mitigate the complexity of full-body dynamics, simplified dynamics models such as inverted pendulums have been used in controller design [KH10, CBvdP10, TLC^∗09, MdLH10, KH17].

Data-driven (a.k.a. example-guided) approaches have been frequently used to improve the naturalness of simulated animations by adopting motion capture data as a reference to track [SKL07, LKL10, LYvdPG12, LvdPY16]. The balance mechanisms for data-driven controllers can be manually-crafted [LKL10], learned from a collection of example motions using a regression method [SKL07], or derived from a linear feedback policy using stochastic optimization and/or linear regression [LYvdPG12, LvdPY16]. Model predictive control (MPC) is used for synthesizing the full-body character animations in a physically plausible manner using reference motion data [DSAP08, HHC^∗19]. Trajectory optimization is also employed to fulfill given tasks [ABDLH12, PM18, LH18]. Many studies have utilized nonlinear optimization methods to improve the robustness of controllers, or to explore control schemes for given tasks [YL10, SKL07, LYvdPG12, dLMH10, WFH10, WHDK12].

2.2 DRL for Locomotion Control

Recently, deep reinforcement learning has received significant attention and shown impressive improvements in the biped control problem. The control policy represented by a deep neural network can effectively achieve a feedback balancing mechanism in learning-based control. A variety of DRL algorithms has been proposed to learn control policies for biped locomotion [HTS^∗17, SML^∗15, YTL18]. Yu et al. \shortciteyu2018learning proposed a mirror symmetry loss with curriculum learning to learn a locomotion policy without any reference data.

DRL algorithms can also take advantage of using motion capture data. Liu et al. \shortciteliu_learning_2017 used deep Q-learning to learn a scheduler that reorders short control fragments which are in charge of reproducing short motion segments. Peng et al. \shortcitepeng_deeploco:_2017 presented a two-level hierarchical DRL-based control framework learned from short reference motion clips. They also presented a DRL method in their follow-up study to learn a control policy that imitates a given reference motion clip [PALvdP18]. Lee et al. \shortcitelee2019scalable proposed a DRL-based controller for a muscle-actuated anatomical model. Learning a control policy that exploits a set of reference motion data can be facilitated by recurrent neural networks [PRL^∗19] and a motion matching technique [BCHF19]. Although the adoption of DRL in biped control has been very successful, few in-depth analysis of the characteristics of DRL-based controllers has been presented.

2.3 Stability Analysis

Gait and postural stability has been measured quantitatively using a waist-pull system [RHJ^∗01] and a movable platform [BWSC01], which can apply quantified perturbations to human subjects. In biomechanics, gait stability has been estimated using measures derived from nonlinear time-series analysis, such as Lyapunov exponents and Floquet multipliers [DCCS00]. The correlation of foot placement with balancing capability has been investigated in biomechanics and robotics [MMK12]. Wight et al. \shortcite10.1115/1.2815334 advocated Foot Placement Estimator (FPE) as a measure of balance for bipedal robots. In computer graphics, measuring the response to unexpected external pushes is a common criterion for estimating the resilience and robustness of controllers [YLvdP07, MdLH10, KH17, LKL10, LvdPY16].

Lee et al. \shortcitelee2015push statistically analyzed the push-recovery stability of humans and that of simulated bipeds controlled by a hand-crafted feedback controller \shortcitelee2010data. Using maximum detour distance as a stability measure, they identified key gait factors (walking speed, level of crouching, push magnitude and timing) that affected the stability of human walking through statistical analysis. Their experimental setups are adopted in our simulation experiments and their human experiments serve as a reference of human vs simulation comparisons in our study.

This study begins with a question about how the characteristics of DRL-based controllers are similar or different compared to human walking and previous biped controllers. We aim to gain a deeper understanding and insight into how DRL achieves better stability in notoriously-challenging control problems.

3 Biped Locomotion Simulation

Refer to caption — Figure 1: Our full-body dynamics model. Green balls and blue cylinders represent ball-and-socket (3 DoF) joints and revolute (1 DoF) joints, respectively.

Our full-body biped model has rigid bones connected by 8 revolute joints (elbows, knees, and toes) and 14 ball-and-socket joints. The total degrees of freedom (DoFs) of the model is 56 with unactuated 6-DoFs root joint. The biped is 1.7 meters tall and weighs 72 kg. The articulated skeleton is physically simulated and actuated by joint torques (see Figure 1).

In this study, we consider two versions of DRL-based algorithms for stability analysis: gait-specific and gait-conditioned. The gait-specific algorithm serves as a common component of recent biped simulation studies [PALvdP18, LPLL19]. It takes a short reference motion clip as input and learns a control policy using reinforcement learning. This example-guided control policy represented by a deep neural network provides a distribution of plausible actions at every state. A series of plausible actions sampled from the distributions would drive the biped to track the reference motion while maintaining its balance.

The gait-conditioned algorithm exploits a family of reference motion clips parameterized by user-controlled parameters, such as walking speed and stride length. The gait-conditioned control policy learns how to deal with variations in gaits and styles. Learning a gait-conditioned policy is more computation-efficient and memory-efficient than learning a grid of gait-specific policies in the parametric domain, since each individual gait-specific policy should be learned from scratch.

3.1 Gait-specific Policies

The state of the biped $s=[p,\dot{p},\phi]$ includes the positions and velocities of the body links relative to the body coordinate system attached to the skeletal root. $\phi\in[0,1]$ is a normalized phase that matches the temporal span of the reference motion. The output of the control policy is a PD (Proportional Derivative) target that generates joint torques through PD control. The reward function is

r=({w}_{q}{r}_{q}+{w}_{v}{r}_{v}){r}_{e}+{w}_{g}{r}_{g},

(1)

where ${r}_{q}$ , ${r}_{v}$ , and ${r}_{e}$ are the reward for tracking reference poses, joint velocities, and positions of end-effectors, respectively. The end-effector and position/velocity tracking rewards are multiplied as suggested by Lee et al. \shortcitelee2019scalable, since they are reinforcing each other. ${r}_{g}$ encourages the biped to walk along a straight line and thus come back to the line after an external push. We set ${w}_{q}=0.8$ , ${w}_{v}=0.1$ , and ${w}_{g}=0.1$ .

\begin{split}r_{q}&=\exp\left(-\frac{\lVert\hat{\mathbf{q}}-\mathbf{q}\rVert^{2}}{\sigma_{q}}\right)\\ r_{v}&=\exp\left(-\frac{\lVert\hat{\dot{\mathbf{q}}}-\dot{\mathbf{q}}\rVert^{2}}{\sigma_{v}}\right)\\ r_{e}&=\exp\left(-\frac{\lVert\hat{\mathbf{p}}_{e}-\mathbf{p}_{e}\rVert^{2}}{\sigma_{e}}\right)\\ r_{g}&=\exp\left(-\frac{\lVert\mathrm{dist}(\mathbf{d},\mathbf{p}_{c})\rVert^{2}}{\sigma_{g}}\right).\end{split}

(2)

Here, $\mathbf{q}$ is an aggregated joint angle vector, $\dot{\mathbf{q}}$ is a time derivative of $\mathbf{q}$ , and $\mathbf{p}_{e}$ is an aggregated vector of end-effector positions. The hat symbol indicates values from the reference motion. $\mathrm{dist}(\mathbf{d},\mathbf{p}_{c})$ is the distance from the center of mass (CoM) of the character $\mathbf{p}_{c}$ to the straight line $\mathbf{d}$ .

We used Proximal Policy Optimization [SWD^∗17] and generalized advantage estimation [SML^∗15] to learn a policy function that maximizes the expected cumulative reward. The learning process is episodic. Many experience tuples are collected stochastically from episodic simulations. In each episode, experience tuples are generated by sampling actions from the policy at every time step and the policy is updated systematically by a batch of experience tuples. We refer the readers to the work of Peng et al. \shortcite2018-TOG-deepMimic for implementation details.

Even if the control policy is learned with a reference motion clip, the exploration strategy of reinforcement learning examines unseen states around the reference trajectory to achieve a certain level of resilience to withstand small unexpected disturbances. The control policy can be even more robust if it learns how to cope with disturbances and uncertainty in the learning process. It has been reported that randomly pushing the biped in the episodic simulations would result in improved robustness [WFH10, PRL^∗19]. In each episode, we applied random force for 0.2 seconds to push the biped sideways from left or right. The detailed values of push magnitude and timing used in the learning process will be described in section 5.

3.2 Gait-conditioned Policies

Human locomotion can be characterized by a family of parameters. The flexibility and representation power of deep neural networks allow parametric variations in gaits to be learned in a single network-based policy, which takes those parameters as state input. The state of the gait-conditioned policy $s=[p,\dot{p},\phi,c,l,v]$ includes three gait parameters $c,l,v$ , which are normalized crouch angle, stride length, and walking speed, respectively. The stance knee is supposed to be straight at the middle of the stance phase in normal gait. The crouch gait has its knee flexed throughout the stance phase (see Figure 2). The crouch angle indicates the level of crouching normalized to $[0,1]$ . The stride length and walking speed are normalized to have zero mean and one standard deviation.

Because a gait-conditioned policy should deal with gait parameters as state input, learning a gait-conditioned policy requires a collection of example motions that span a target parametric domain. We generate example motions by kinematically varying a single reference motion clip, which represents a normal gait with average walking speed and stride length. Given parameter values $(c,l,v)$ , the use of hierarchical displacement mapping and time warping edits the reference motion clip to have the desired stride length $l$ , crouch angle $c$ , and walking speed $v$ [LS99].

During episodic learning, the initial state of each episode is taken from the state space containing the target parametric domain. A common way to determine the initial state is to sample uniformly in the state space. This naïve learning of a policy often suffers from a biased exploration problem. Many DRL algorithms tend to explore successful regions in the target domain more aggressively while leaving less successful regions unexplored in starvation. For example, human walking is more stable when it crouches to lower down its CoM [Sch87]. The control policy parameterized by a crouch angle would explore crouch walking more frequently than normal (straight-knee-at-stance) walking. Consequently, the learned normal gait in the policy would be less robust.

Recently, Won et al. \shortciteWon:2019 proposed an adaptive sampling method to deal with parametric body shape variations. We adopt their sampling idea to learn our gait-conditioned policies. The key idea is to give more opportunities to less successful regions in the target domain. The measure of success is a marginal value function $V_{m}(s_{\alpha})$ that estimates the sum of expected rewards for each gait parameter $s_{\alpha}=(c,l,v)$ .

V_{m}(s_{\alpha})=\int_{{S}_{\beta}}V(s_{\alpha},s_{\beta})p_{s}(s_{\alpha},s_{\beta})ds_{\beta},

(3)

where $V(s)=V(s_{\alpha},s_{\beta})$ is a value function which approximately measures the cumulative reward when the controller follows current policy from state $s$ , $s_{\beta}$ is a state vector excluding gait parameters, $S_{\beta}$ is the domain of $s_{\beta}$ , and $p_{s}$ is a density function which is assumed a constant. The probability of exploring $P({s}_{\alpha})$ is

P({s}_{\alpha})=\frac{1}{Z}\exp\left(-k\left(\frac{V_{m}({s}_{\alpha})}{\mu}-1\right)\right).

(4)

Here, $\mu$ is the expectation of $V_{m}$ , which is updated along with $V_{m}$ , $Z$ is a scaling factor to normalize $P$ , and $k$ is the value that decides the degree of uniformity of $P$ over gait parameter space. MCMC (Markov Chain Monte Carlo) sampling with the probability aims to make the marginal value function $V_{m}(s_{\alpha})$ near-uniform across the domain of $s_{\alpha}$ . We chose $k=1$ for all controllers.

4 Push-Recovery Experiments

The primary goal of this research is to assess the robustness of deep policies in comparison with human subjects and pre-deep learning controllers. The study of Lee et al. \shortcitelee2015push provides a reference for the comparison. We conduct push-recovery experiments with identical simulation setups to generate measurement data that can be directly compared to their results.

The motion capture and measurement data from the previous study are available on their project webpage [LLK^∗15]. The data set includes push-recovery experiments of 29 healthy adults (14 males and 15 females). The participants walked along a straight line with the choice of two speeds (normal/slow), two stride lengths (normal/short), and four levels of crouching (normal/20^∘/30^∘/60^∘). The experimenter pushed the participants sideways to measure the maximal detour distance (see Figure 3). They also recorded a great deal of data including the height, weight, leg length, BMI of the participants, magnitude/timing/direction/duration of pushes, number of steps to maximal detour, detour of the first step, and push force normalized by height/weight/leg length. We refer the readers to the work of Lee et al. \shortcitelee2015push for the details of the data acquisition process and specification.

Lee et al. \shortcitelee2015push also performed statistical analysis on their data using a LMM (Linear Mixed Model) method and identified four significant factors (the level of crouch, walking speed, push timing and magnitude) that are correlated with maximal detour distance. Based on the analysis, they also performed comparisons between human and simulation experiments to see if simulated controllers are as robust as human balance strategies. Specifically, they adopted a Data-Driven Controller (DDC) by Lee et al. \shortcitelee2010data for their experiments, which was designed before the advent of deep learning. Their experiments showed that the response pattern of the controller is qualitatively similar to how humans respond to external pushes, though the controller is not as robust as human walking yet.

In this paper, we conduct two sets of experiments. The first type of experiments is to faithfully reproduce the experiments of Lee et al. \shortcitelee2015push with a new state-of-the-art biped locomotion simulation. Through the experiment, we would like to understand how deep policies compare with human walking and pre-deep-learning simulators. To do so, we generated a family of kinematic walking motions with stride lengths, walking speeds, crouch angles, and the magnitude/timing of pushes that match the distribution of human data (see Figure 4). The crouch angle is discrete, while all the other parameters are continuous and sampled from normal distributions. The push force in the simulation environment is applied to the shoulder and its magnitude is also sampled from the normal distribution of the human experiment data. The push timing is sampled between the left heel strike ( $0\%$ ) and the subsequent right heel strike ( $100\%$ ). Ten thousand push experiments were performed for each of four crouch levels.

The second type of experiments is for assessing the implementation choices in DRL algorithms, such as random perturbation in learning, gait-specific versus gait-conditioned policies, and adaptive sampling. Since only DRL algorithms are compared with each other, we measure the success rate for evaluating the stability of control policies. An episode of simulation is considered successful if the biped withstands a push and keeps walking for 10 seconds afterwards while maintaining its balance.

There are several popular methods, such as maximum Lyapunov exponents and maximum Floquet multipliers [DCCS00], for estimating the stability of dynamical systems. These methods quantify how the dynamical systems respond to small perturbations assuming strict periodicity. Small exponents or multipliers indicate that the system would return to a limit cycle. Human locomotion under disturbances is not a simple dynamical system since the control system is continuously modulated to maintain its balance. Its analysis often requires radical approximation of the dynamical system. In computer graphics, push-recovery stability is more popular since resilience against unexpected disturbances is closer to real-life notions of stability. Push-recovery stability quantifies how humans respond to impulsive perturbations. More specifically, maximum detour distances are useful if the experiment involves human participants who can cope with only moderate disturbances. The rate of success would be a better criterion under larger disturbances. There is no ultimate measure of dynamic stability. Assorted measures illuminate different aspects of gait stability. It is possible that a particular gait is very stable in one criterion but not in another.

5 Analysis and Results

We used Intel® Core™ i9-9900K CPU @ 3.60GHz (8 cores) for training. Training takes about 24 hours for gait-specific policies, and 40 to 50 hours for gait-conditioned policies. The neural network for all DRL controllers consists of three fully connected layers with 256 nodes. The network is updated whenever 8192 experience tuples are collected with 256 batch size. We used R (version 3.6.1) for the statistical analysis.

In our experiments, we compared the push-recovery capabilities of human participants (Human), a data-driven controller (DDC) with hand-crafted feedback laws [LKL10], DRL gait-specific policies (DRL-specific), and DRL gait-conditioned policies with two continuous parameters (walking speed and stride length) and one discrete parameter (crouch angle). We denote gait-conditioned policies (see Table 1) learned with and without adaptive sampling by DRL-A and DRL-B, respectively. DRL-specific*, DRL-A*, and DRL-B* are policies learned with unexpected disturbances in the learning phase.

Table 1: The name of each control policy. An asterisk means that the policy is trained with push, and -A means that adaptive sampling method is applied in the training.

	With push	Without push
Adaptive sampling	DRL-A*	DRL-A
Uniform sampling	DRL-B*	DRL-B

Table 2: The means and standard deviations of factors in human experiments. The push magnitude is normalized by weight.

Experimental factor		Mean	Std
Walking	Normal	0.994	0.263
speed	Crouch $20\,^{\circ}$	0.808	0.210
(m/s)	Crouch $30\,^{\circ}$	0.788	0.221
	Crouch $60\,^{\circ}$	0.744	0.228
Stride	Normal	1.126	0.180
length	Crouch $20\,^{\circ}$	0.953	0.158
(m/s)	Crouch $30\,^{\circ}$	0.916	0.167
	Crouch $60\,^{\circ}$	0.876	0.168
Push magnitude (N $\cdot$ s/kg)		0.535	0.096
Push timing (%)		34.0	21.0

The gait parameter domains for learning gait-conditioned policies are decided from human data (see Table 2). More precisely, at the beginning of every episode in the learning process, gait parameters are sampled uniformly (for DRL-B and DRL-B*) or adaptively (for DRL-A and DRL-A*) in the region within a certain Mahalanobis distance $t$ . We chose the value of $t$ to make the sampling region be 95% confidence region. Note that the walking speed and stride length are correlated, and the correlation coefficient for each crouch walking ranges from 0.67 to 0.83. For the training of control policies (DRL-specific*, DRL-A*, and DRL-B*) under the circumstance where external pushes are exploited, the simulated character is pushed for 0.2 seconds from the left or the right randomly. The push direction always matches the stance leg. So, the push from the left happens when the left leg is in stance and vice versa. The magnitude and timing of pushes are sampled to match the distributions in the human experiment data (Table 2). The push magnitudes were mostly in the range of 100N to 300N.

5.1 Human vs Simulation

The statistics of gait factors in human experiments are shown in Table 2. The trials in human data are classified into two groups: Group 1 and Group 2. The participants in Group 1 trials recovered their balance in a single step, and thus the detour distance peaked in the step. Most of the participants recovered their balance within three balance-correcting steps except for a few outliers. They took more than one step when they experienced mild difficulty. In the outliers, participants panicked by unexpected pushes and failed to return to the line. The Group 2 includes all trials except for the outliers.

In Figure 5, we compare the maximum detour distances of Human, DDC, DRL-B*, and DRL-A*. The results of Human and DDC experiments were taken from the work of Lee et al. \shortcitelee2015push. The simulation experiments were designed to reproduce the human experiments with setups and data distributions carefully tuned to match the target experiments. The only difference is that we can collect far more trial data by simulation. The Group 1 of Human, DDC, DRL-B*, and DRL-A* include 228, 3858, 4851, and 16036 trials, respectively, and the Group 2 includes 450, 13707, 20534, and 27993 trials, respectively. The comparison graph shows that DRL-B* and DRL-A* are clearly superior to DDC and comparable to human participants.

It was empirically verified in the previous study that crouch walking is more stable than normal walking in Human and DDC experiments. In particular, $30^{\circ}$ -crouch walking was the most stable. This postulation agrees with our intuition that lowering down the CoM improves the gait stability. Although the detour distance measurements of DRL-B* and DRL-A* do not follow this trend, the success rate experiments agree with the postulation. Graph 6 shows that both DRL-B* and DRL-A* are more robust with 20^∘/30^∘-crouch walking than normal and 60^∘-crouch walking. $30^{\circ}$ -crouch walking was the most robust in terms of the success rate.

Table 3: Type 3 tests of the fixed effects (level of crouch, walking speed, magnitude/timing of push) on the detour distance. Significant values (

p<0.05

) are shown in bold.

		Type 3 Tests for fixed effects
		Level of Crouch		Push Magnitude		Walking Speed		Push Timing
		$F_{c}$	$p_{c}$	$F_{f}$	$p_{f}$	$F_{s}$	$p_{s}$	$F_{t}$	$p_{t}$
Group 1	Human	17.49	<.0001	13.42	0.0003	2.68	0.1098	14.35	<.0001
	DDC	30.06	<.0001	17546	<.0001	106.6	<.0001	463.4	<.0001
	DRL-A*	196.5	<.0001	11791	<.0001	4536	<.0001	363.5	<.0001
	DRL-B*	768.7	<.0001	3022	<.0001	625.6	<.0001	130.0	<.0001
Group 2	Human	8.35	<.0001	0.01	0.9297	0.03	0.8578	3.94	0.0479
	DDC	88.34	<.0001	19103	<.0001	371	<.0001	225.8	<.0001
	DRL-A*	7.857	0.0051	12427	<.0001	251.8	<.0001	2329	<.0001
	DRL-B*	575.6	<.0001	10937	<.0001	57.53	<.0001	331.6	<.0001

Table 3 shows the results of Type 3 tests, which shows the fixed effects of level of crouch, walking speed, magnitude/timing of push on the detour distance. The tests confirm that all factors that were proven to be significant in the human experiments are also significant ( $p<0.05$ ) in the DRL experiments.

5.2 Comparison of Control Policies

Gait-specific policies DRL-specific* supposedly outperform gait-conditioned policies DRL-A* and DRL-B* since the scope of the gait-specific policy focuses on a single reference trajectory, while the gait-conditioned policies have to cope with a continuous spectrum of parametric domains. Specifically, the domain $[\mu_{s}-1.5\sigma_{s},\mu_{s}+2\sigma_{s}]\times[\mu_{l}-1.5\sigma_{l},\mu_{l}+2\sigma_{l}]$ is explored in our experiments, where $\mu_{s}$ and $\mu_{l}$ are the average walking speed and stride length, respectively, in the human experiments. $\sigma_{s}$ and $\sigma_{l}$ are their standard derivations. The use of an asymmetric domain is related to the perception plausibility of motion editing. Edited character animations that walk slower than a reference are more likely to look unnatural than animations that walk faster than the reference [VHBO14].

We conducted push-recovery experiments of 1000 trials for each of the three policies. Push forces are drawn from a normal distribution with a mean of 200N and a standard derivation of 35N (see Figure 7). A family of motion clips were generated to learn DRL-specific* polices. The stability of each gait-specific policy learned for a particular walking speed and a stride length is comparable to the stability of DRL-B* near the mean values, but clearly outperforms when the samples are away from the means. It means that naïve parametric learning can deal with only a narrow range of the parametric domain. Adaptive sampling of DRL-A* improves the stability at the corners of the domain and consequently learns a policy practically usable over the entire domain within a modest computational time. Figure 8 depicts the parametric coverage of successful episodes. The coverage of DRL-A* is wider than the coverage of DRL-B*. It means that DRL-A* can better deal with slow walking (narrow strides) and fast walking (wide strides), while DRL-B* is effective only at mid ranges.

Deciding the number of hidden layers of the policy network depends on the dimension and size of the action, state, and parameter spaces. In practice, the number of layers and the number of nodes in each layer are important hyper-parameters to tune empirically. In our experiments, we tested with two, three, and four hidden layers and found that learning was the most successful with three layers.

5.3 Foot Placement Analysis

It has been previously postulated that balancing strategies based on foot placements play a dominant role in dynamic walking. We hypothesized that deep policies learn to smartly choose post-push footholds to recover balance within a few steps. We conducted push-recovery experiments with four deep policies DRL-A, DRL-B, DRL-A*, and DRL-B*. In each trial, the simulated biped was pushed from the left during its left foot in stance and took the next step on the right to recover its balance. Figure 9 shows that the stability of control policies are characterized by the pattern of post-push foot placements, which are plotted with respect to the reference footprint to step on if there are no disturbances. The push magnitude and timing are randomly sampled, while the walking speed, stride length, and crouch angle are fixed. Push magnitudes are sampled from a normal distribution with a mean of $200N$ and a standard derivation of $35N$ . Push timings are sampled from another normal distribution of a mean of $34\%$ and a standard deviation of $21\%$ .

We used 158 normal gait data, and 22 30^∘-crouch gait data . Note that the human data were collected from many participants under varied conditions and thus necessarily noisy, while the simulation data were collected in a controlled simulation setup. Therefore, the side-by-side direct comparison is meaningless, but the data should be interpreted qualitatively. In the human data (Figure 9(a), 9(d)), the disturbed swing foot tends to land behind (in the y-axis) the undisturbed step position. It means that human participants tend to slow down temporarily to cope with lateral disturbances. The success rate of DRL-A is very low ( $3.0\%$ for normal walking and $10.1\%$ for 30^∘-crouch walking), and the post-push footholds of successful trials are all at nearly the same y-coordinate as the undisturbed reference step position since DRL-A did not learn how to cope with disturbances. Unlike DRL-A, DRL-A* exploited uncertainty in the learning phase and learned to adjust its step length (shorter in the y-axis) similarly to human balance strategies.

The foot placement graphs in Figure 9 also show that post-push step positions are closely correlated with push timing. The push may occur in either a double-stance phase $[0\%,20\%]$ or a swing phase $[20\%,100\%]$ . The push at the early swing phase allows the walker to have a sufficient time to move its leg to an appropriate position for balance recovery, while the push at the late swing phase forces the walker to put the swing foot down rapidly with little time for control. Therefore, control policies are more resilient to early-swing pushes than late-swing pushes. Note that maximum detour distances are shorter (can be interpreted as more resilient) with later-swing pushes, which seems contradictory to the proposition based on the success rate. As we discussed before, there is no ultimate criterion for resilience and gait stability. The notion of stability can be characterized richly by a mixture of measures. For example, Figure 9 shows that the post-push step positions tend to be farther from the undisturbed reference step position as the timing of the push is earlier (the 20% curve is farther than the 60% curve from the reference position), but the success rate for early pushes are higher than late pushes (the ratio of blue and green dots in the <20% push timing region is higher than that of the 40%–60% region).

As shown in Figure 9, the successful post-push steps of normal gaits are mostly placed less in the y-axis than the reference stepping position, as in the human data. 30^∘-crouch gaits are not necessarily the case, but the y values of more successful post-push steps are smaller than that of the reference position. This means that perturbation during the learning process allows learning policies that are more similar to the actual human balancing mechanism. The significant fatigue experienced by the human participants during crouch walking is not modeled in our simulation model, which might result in the slightly different tendency of post-push steps for 30^∘-crouch gaits.

The plots in Figure 10 depict the correlation between post-push foot placements and push magnitude/timing. DRL-A* learned its balance strategies based on foot placements, which are strongly correlated with how strong the disturbance is and when it occurs. DRL-A* is more robust than DRL-A at all push magnitudes and timings. DRL-A* adjusts its post-push foothold not only in the push (lateral) direction but also in the moving direction to better respond to the pushes. We also observed the clear correlation of the number of detour steps with push magnitude. As push magnitude increases, the character needs one step (Group 1 marked in blue) to three steps (Group 2 - Group 1 marked in green) to recover balance after the push.

5.4 Application: Interactive Control

In an interactive control problem, it is critical to process the user input on-the-fly. A sudden change of the reference motion during the simulation can act as an impulse in the system. Since the gait-conditioned policies learn from diverse walking motions under disturbances, a simulated character controlled by these policies is robust to the change of input motions, and thus can transit from a gait pattern to another smoothly without any additional technique such as motion blending.

We demonstrate an interactive control to show the robustness of our gait-conditioned policies (Figure 11). In this example, we compare the four deep control policies: DRL-A*, DRL-B*, DRL-A, and DRL-B. Each character is controlled by a single control policy, and the reference motion is changed whenever the user input consisting of walking speed, stride length, and crouch angle is given. Despite diverse types of sabotaging with the obstacle, push forces, and repeated changes of input motion, the character controlled by DRL-A* survives alone to the last. Please refer the supplemental video at (05:24).

6 Discussion

Biped locomotion control has been a long-standing problem in computer graphics. Tremendous efforts have been put into addressing this problem for decades and the computer graphics community has finally reached to the point where simulated controllers are worth comparing with the performance and capability of humans in terms of robustness, energy-efficiency, agility, diversity, and flexibility. This study would be a stepping stone towards rigorous evaluation of DRL policies for biped locomotion.

Although our experiments demonstrated the push-recovery stability of DRL policies in varied conditions, there are still numerous gait factors, body/environment conditions, types of disturbances, technical issues, and limitations to be addressed in future studies. Currently, the effects of lateral disturbances are evaluated in our study. We can think of many other conditions including disturbances at arbitrary locations and directions, and responses to slipping, tripping, slopes, and uneven terrain [PBYVDP17, WFH10].

Physiological factors such as effort and fatigue are not considered yet. The dynamics model of our bipeds does not have explicit torque limits. Without torque limits, the biped can withstand arbitrarily strong disturbances with unrealistically fast and strong responses. Fortunately, such an unnatural situation does not occur in our experiments probably because joint torques are implicitly limited in the reference-tracking framework. It is strenuous for human participants to walk in $60^{\circ}$ -crouch. $60^{\circ}$ -crouch walking is less stable than $30^{\circ}$ -crouch walking probably due to fatigue, which is not implemented in our DRL algorithms. The simulated bipeds never get tired in strenuous tasks. Incorporating effort and fatigue into a dynamical system requires the concept of energy/torque minimization, which plays a central role in trajectory optimization algorithms [ABDLH12, LPKL14]. However, it is still unclear how the concept can be implemented in the DRL framework. Learning energy-efficient, compliant policies would be an interesting direction for future studies.

In our work, we parameterize the reference motion clip to simulate various walking motions with crouch angles, walking speed, and stride length. Then the motion is warped according to selected gait parameters and then used to train a control policy. The problem is that, unlike the unwarped reference motion that reflects the dynamics of character, warped motions might not be compatible with the model’s dynamics. However, our gait-conditioned policy can mimic various motions warped by gait parameters. This is possible because there is a common control method required to mimic walking motions, and this common control method can be learned through DRL. The control method can be learned more easily from the motions close to the model’s dynamics and serves as a guide for the largely warped motions far from the dynamics.

Lessons can be drawn from our experiments for the evaluation and implementation of DRL algorithms. First, gait and stability are strongly correlated. Crouching and stomping gaits that might look unnatural or impaired would be more robust than typical, normal gaits. Therefore, evaluating the robustness and stability of the controller entails gait normalization for a fair comparison. Secondly, including uncertainty and disturbances in the learning phase is imperative for learning robust control policies that mimic human balance strategies. Thirdly, gait-conditioned policies are very useful in terms of computation time and memory usage. The use of adaptive sampling is strongly encouraged to improve the robustness of control policies uniformly across a range of parametric domain.

There are many promising applications we can think of. Interactive graphics applications, such as video games, often show animated controllable characters that hit, react, and fall. Understanding their resilience against disturbances would be useful for better simulation of their interactions. We can also think of an exoskeleton-type walking assist device that generates assist force/torque at the joint of the wearer. Many devices are designed to help the elderly and the handicapped who are exposed to the risk of falls. Walking assist devices equipped with the ability to prevent falls would be highly desirable [CLDK07, Low11]. We believe that our analysis in this study will provide a solid basis for designing such devices.

References

[ABDLH12] Al Borno M., De Lasa M., Hertzmann A.: Trajectory optimization for full-body movements with complex contacts. IEEE transactions on visualization and computer graphics 19, 8 (2012).
[BCHF19] Bergamin K., Clavet S., Holden D., Forbes J. R.: Drecon: data-driven responsive control of physics-based characters. ACM Transactions on Graphics 38, 6 (2019).
[BWSC01] Brauer S. G., Woollacott M., Shumway-Cook A.: The interacting effects of cognitive demand and recovery of postural stability in balance-impaired elderly persons. The Journals of Gerontology Series A: Biological Sciences and Medical Sciences 56, 8 (2001).
[CBvdP10] Coros S., Beaudoin P., van de Panne M.: Generalized biped walking control. ACM Transactions on Graphics 29, 4 (2010).
[CLDK07] Constantinescu R., Leonard C., Deeley C., Kurlan R.: Assistive devices for gait in parkinson’s disease. Parkinsonism & related disorders 13, 3 (2007).
[DCCS00] Dingwell J. B., Cusumano J. P., Cavanagh P., Sternad D.: Local dynamic stability versus kinematic variability of continuous overground and treadmill walking. J. Biomech. Eng. 123, 1 (2000).
[dLMH10] de Lasa M., Mordatch I., Hertzmann A.: Feature-based locomotion controllers. ACM Transactions on Graphics 29, 4 (2010).
[DSAP08] Da Silva M., Abe Y., Popović J.: Simulation of human motion data using short-horizon model-predictive control. In Computer Graphics Forum (2008), vol. 27.
[HHC^∗19] Hong S., Han D., Cho K., Shin J. S., Noh J.: Physics-based full-body soccer motion control for dribbling and shooting. ACM Transactions on Graphics 38, 4 (2019).
[HTS^∗17] Heess N., TB D., Sriram S., Lemmon J., Merel J., Wayne G., Tassa Y., Erez T., Wang Z., Eslami S. M. A., Riedmiller M., Silver D.: Emergence of Locomotion Behaviours in Rich Environments. arXiv preprint arXiv:1707.02286 (2017).
[HWBO95] Hodgins J. K., Wooten W. L., Brogan D. C., O’Brien J. F.: Animating human athletics. In ACM SIGGRAPH (1995).
[KH10] Kwon T., Hodgins J.: Control systems for human running using an inverted pendulum model and a reference motion capture sequence. In Proceedings of the 2010 ACM SIGGRAPH/Eurographics Symposium on Computer Animation (2010), Eurographics Association.
[KH17] Kwon T., Hodgins J. K.: Momentum-Mapped Inverted Pendulum Models for Controlling Dynamic Human Motions. ACM Transactions on Graphics 36, 1 (2017).
[LH17] Liu L., Hodgins J.: Learning to Schedule Control Fragments for Physics-Based Characters Using Deep Q-Learning. ACM Transactions on Graphics 36, 3 (2017).
[LH18] Liu L., Hodgins J.: Learning basketball dribbling skills using trajectory optimization and deep reinforcement learning. ACM Transactions on Graphics 37, 4 (2018).
[LKL10] Lee Y., Kim S., Lee J.: Data-driven biped control. ACM Transactions on Graphics 29, 4 (2010).
[LLK^∗15] Lee Y., Lee K., Kwon S.-S., Jeong J., O’Sullivan C., Park M. S., Lee J.: Push-recovery stability of biped locomotion. ACM Transactions on Graphics 34, 6 (2015).
[Low11] Low K.: Robot-assisted gait rehabilitation: From exoskeletons to gait systems. In 2011 Defense Science Research Conference and Expo (DSR) (2011), IEEE.
[LPKL14] Lee Y., Park M. S., Kwon T., Lee J.: Locomotion control for many-muscle humanoids. ACM Transactions on Graphics 33, 6 (2014).
[LPLL19] Lee S., Park M., Lee K., Lee J.: Scalable muscle-actuated human simulation and control. ACM Transactions on Graphics 38, 4 (2019).
[LS99] Lee J., Shin S. Y.: A hierarchical approach to interactive motion editing for human-like figures. In Siggraph (1999), vol. 99, ACM Press/Addison-Wesley Publishing Co.
[LvdPY16] Liu L., van de Panne M., Yin K.: Guided learning of control graphs for physics-based characters. ACM Transactions on Graphics 35, 3 (2016).
[LYvdPG12] Liu L., Yin K., van de Panne M., Guo B.: Terrain runner: control, parameterization, composition, and planning for highly dynamic motions. ACM Transactions on Graphics 31, 6 (2012).
[MdLH10] Mordatch I., de Lasa M., Hertzmann A.: Robust physics-based locomotion using low-dimensional planning. ACM Transactions on Graphics 29, 4 (2010).
[MMK12] Millard M., McPhee J., Kubica E.: Foot Placement and Balance in 3D. Journal of Computational and Nonlinear Dynamics 7, 2 (2012).
[PALvdP18] Peng X. B., Abbeel P., Levine S., van de Panne M.: Deepmimic: Example-guided deep reinforcement learning of physics-based character skills. ACM Transactions on Graphics 37, 4 (2018).
[PBYVDP17] Peng X. B., Berseth G., Yin K., Van De Panne M.: DeepLoco: Dynamic Locomotion Skills Using Hierarchical Deep Reinforcement Learning. ACM Transactions on Graphics 36, 4 (2017).
[PM18] Pan Z., Manocha D.: Active animations of reduced deformable models with environment interactions. ACM Transactions on Graphics 37, 3 (2018).
[PRL^∗19] Park S., Ryu H., Lee S., Lee S., Lee J.: Learning predict-and-simulate policies from unorganized human motion data. ACM Transactions on Graphics 38, 6 (2019).
[RHJ^∗01] Rogers M. W., Hedman L. D., Johnson M. E., Cain T. D., Hanke T. A.: Lateral stability during forward-induced stepping for dynamic balance recovery in young and older adults. The Journals of Gerontology Series A: Biological Sciences and Medical Sciences 56, 9 (2001).
[Sch87] Schafer R. C.: Clinical Biomechanics: Musculoskeletal Actions and Reactions. Williams & Wilkins, 1987.
[SKL07] Sok K. W., Kim M., Lee J.: Simulating biped behaviors from human motion data. ACM Transactions on Graphics 26, 3 (2007).
[SML^∗15] Schulman J., Moritz P., Levine S., Jordan M., Abbeel P.: High-dimensional continuous control using generalized advantage estimation. arXiv preprint arXiv:1506.02438 (2015).
[SWD^∗17] Schulman J., Wolski F., Dhariwal P., Radford A., Klimov O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017).
[TLC^∗09] Tsai Y.-Y., Lin W.-C., Cheng K. B., Lee J., Lee T.-Y.: Real-time physics-based 3d biped character animation using an inverted pendulum model. IEEE transactions on visualization and computer graphics 16, 2 (2009).
[VHBO14] Vicovaro M., Hoyet L., Burigana L., O’Sullivan C.: Perceptual evaluation of motion editing for realistic throwing animations. ACM Transactions on Applied Perception 11, 2 (2014).
[WFH10] Wang J. M., Fleet D. J., Hertzmann A.: Optimizing walking controllers for uncertain inputs and environments. ACM Transactions on Graphics 29, 4 (2010).
[WHDK12] Wang J. M., Hamner S. R., Delp S. L., Koltun V.: Optimizing locomotion controllers using biologically-based actuators and objectives. ACM Transactions on Graphics 31, 4 (2012).
[WKW07] Wight D. L., Kubica E. G., Wang D. W. L.: Introduction of the Foot Placement Estimator: A Dynamic Measure of Balance for Bipedal Robotics. Journal of Computational and Nonlinear Dynamics 3, 1 (2007).
[WL19] Won J., Lee J.: Learning body shape variation in physics-based characters. ACM Transactions on Graphics 38, 6 (2019).
[YL10] Ye Y., Liu C. K.: Optimal feedback control for character animation using an abstract model. ACM Transactions on Graphics 29, 4 (2010).
[YLvdP07] Yin K., Loken K., van de Panne M.: Simbicon: Simple biped locomotion control. ACM Transactions on Graphics 26, 3 (2007).
[YTL18] Yu W., Turk G., Liu C. K.: Learning symmetric and low-energy locomotion. ACM Transactions on Graphics 37, 4 (2018).