This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Constrained Reinforcement Learning for Dexterous Manipulation

Abhineet Jain    Jack Kolb&Harish Ravichandar Georgia Institute of Technology
{abhineetjain, kolb}@gatech.edu, [email protected]
Abstract

Existing learning approaches to dexterous manipulation use demonstrations or interactions with the environment to train black-box neural networks that provide little control over how the robot learns the skills or how it would perform post training. These approaches pose significant challenges when implemented on physical platforms given that, during initial stages of training, the robot’s behavior could be erratic and potentially harmful to its own hardware, the environment, or any humans in the vicinity. A potential way to address these limitations is to add constraints during learning that restrict and guide the robot’s behavior during training as well as roll outs. Inspired by the success of constrained approaches in other domains, we investigate the effects of adding position-based constraints to a 24-DOF robot hand learning to perform object relocation using Constrained Policy Optimization. We find that a simple geometric constraint can ensure the robot learns to move towards the object sooner than without constraints. Further, training with this constraint requires a similar number of samples as its unconstrained counterpart to master the skill. These findings shed light on how simple constraints can help robots achieve sensible and safe behavior quickly and ease concerns surrounding hardware deployment. We also investigate the effects of the strictness of these constraints and report findings that provide insights into how different degrees of strictness affect learning outcomes. Our code is available at https://github.com/GT-STAR-Lab/constrained-rl-dexterous-manipulation.

Refer to caption
Figure 1: The relocation task in MuJoCo. This task requires the robot hand to pick up the blue ball from the tabletop and carry it to the green goal region.

1 Introduction

Dexterous manipulation often involves the use of high degree-of-freedom robots to manipulate objects. Representative dexterous manipulation tasks include relocating objects, picking up arbitrarily shaped objects, and sequential interactions with articulated objects (e.g. unlatching and opening a door). Indeed, factors such as high-dimensional state spaces and complex interaction dynamics make these tasks challenging to automate. Classical control methods are hard to recruit for dexterous manipulation due to the manual effort required to design controllers in high-dimensional spaces.

Prior work in dexterous manipulation has succeeded by using self-supervised methods in simulation, and transferring learned policies to real robots Akkaya et al. (2019). Others have utilized demonstrations to improve reinforcement learning Rajeswaran et al. (2017). However, these approaches are hard to train on real robots, as initial robot behavior can be erratic and unsafe.

In this work, we explore adding instance-specific constraints to an object relocation task (Fig. 1), that restrict and guide the robot’s behavior during training as well as roll outs. Constrained Policy Optimization (CPO) is an effective method to solve constrained MDPs Achiam et al. (2017), built upon trust-region policy optimization (TRPO) Schulman et al. (2015). We formulate a cylindrical boundary constraint for the initial motion of the robot hand towards the object (Fig. 2). The robot incurs a penalty when it moves outside the boundary. We find that using CPO with this simple geometric constraint can ensure the robot learns to move towards the object sooner than without constraints. Further, training with this constraint (CPO) requires a similar number of samples as its unconstrained counterpart (TRPO) to master the skill. These findings shed light on how simple constraints can help robots achieve sensible and safe behavior quickly and ease concerns surrounding hardware deployment. We also investigate the effects of the strictness of these constraints and report findings that provide insights into how different degrees of strictness affect learning outcomes.

Refer to caption
Figure 2: The boundary constraint defined in Eq. 3 for the initial motion of the robot hand towards the ball. The robot incurs a penalty when it moves outside the boundary.
Refer to caption
Figure 3: Comparing CPO with TRPO and TRPO-RP. Bottom: CPO has reduced average number of violations than both the TRPO policies for all the three different intensities of the boundary constraint. CPO learns to satisfy the constraints better throughout the training process. Top: CPO has lower sample efficiency for all the three constraint cases, which is a small trade-off to ensure safe learning.

2 Related Work

Previous works explore self-supervised methods to manipulate objects by adding different types of constraints. To gently lift objects, tactile sensors have been used to constrain contact forces in a 24-DOF robot hand Huang et al. (2019). However, this approach does not consider task performance. Another work trains dynamic motion primitives (DMP) for a 10-DOF robot hand considering virtual joint constraints or friction Li et al. (2017). This approach does not provide any safety guarantees beyond DMPs being deterministic. In low-dimensional environments, one work looks into stable robot path trajectories using graph optimization for motion planning, but does not focus on performance Englert and Toussaint (2016). Another work focuses on in-hand object manipulation in a low-dimensional environment by adding constraints between the robot and object Kobayashi et al. (2005). In multi-agent settings, boundaries based on robot geometry have been used to enable safe collaboration in close quarters to provide safety guarantees Gu et al. (2017).

3 Background

3.1 Trust Region Policy Optimization (TRPO)

TRPO is a policy gradient method to solve Markov Decision Processes, that avoids parameter updates that change the policy too much with a KL divergence constraint on the size of the policy update at each iteration Schulman et al. (2015).

We train TRPO on-policy i.e., the policy for collecting data is same as the policy that we want to optimize. The objective function J(θ)J(\theta) measures the total advantage A^θold\hat{A}_{\theta_{old}} over the state visitation distribution pπθoldp^{\pi_{\theta_{old}}} and actions from πθold\pi_{\theta_{old}}, while the mismatch between the training data distribution and the true policy state distribution is compensated with an importance sampling estimator. TRPO aims to maximize the objective function subject to a trust region constraint which enforces the distance between old and new policies measured by KL-divergence to be small enough, within a parameter δ\delta. The same can be summarized in Eq. 1.

maximizeJ(θ)=𝔼spπθold,aπθold(πθ(a|s)πθold(a|s)A^θold(s,a))subject to𝔼spπθold[DKL(πθold(.|s)πθ(.|s)]δ\displaystyle\begin{aligned} \text{maximize}\ J(\theta)=\mathbb{E}_{s\sim p^{\pi_{\theta_{old}}},a\sim\pi_{\theta_{old}}}\left(\frac{\pi_{\theta}(a|s)}{\pi_{\theta_{old}}(a|s)}\hat{A}_{\theta_{old}}(s,a)\right)\\ \text{subject to}\ \mathbb{E}_{s\sim p^{\pi_{\theta_{old}}}}\left[D_{KL}(\pi_{\theta_{old}}(.|s)\parallel\pi_{\theta}(.|s)\right]\leq\delta\end{aligned}

(1)

3.2 Constraint Policy Optimization (CPO)

CPO Achiam et al. (2017) is built on top of the TRPO algorithm to solve Constrained Markov Decision Processes (CMDPs) which include a cost function, CC and a cost discount factor γc\gamma_{c} along with the standard MDP learning problem (S,A,T,R,p0,γ)(S,A,T,R,p_{0},\gamma). In a local policy search for CMDPs, on top of the TRPO optimization, we additionally require policy updates to be feasible for the CMDP. Our objective function thus adds another condition to limit the expected discounted cost under a cost limit, clcl for each constraint. The same can be summarized in Eq. 2.

maximizeJ(θ)=𝔼spπθold,aπθold(πθ(a|s)πθold(a|s)A^θold(s,a))subject to𝔼spπθold[DKL(πθold(.|s)πθ(.|s)]δand𝔼τπθ[t=0γctCi(st,at,st+1)]clii\displaystyle\begin{aligned} \text{maximize}\ J(\theta)=\mathbb{E}_{s\sim p^{\pi_{\theta_{old}}},a\sim\pi_{\theta_{old}}}\left(\frac{\pi_{\theta}(a|s)}{\pi_{\theta_{old}}(a|s)}\hat{A}_{\theta_{old}}(s,a)\right)\\ \text{subject to}\ \mathbb{E}_{s\sim p^{\pi_{\theta_{old}}}}\left[D_{KL}(\pi_{\theta_{old}}(.|s)\parallel\pi_{\theta}(.|s)\right]\leq\delta\\ \text{and}\ \mathbb{E}_{\tau\sim\pi_{\theta}}\left[\sum_{t=0}^{\infty}\gamma_{c}^{t}C_{i}(s_{t},a_{t},s_{t+1})\right]\leq cl_{i}\forall i\end{aligned}

(2)

3.3 Problem Formulation

We consider an object relocation task where the agent, a 24-DOF Adroit hand, learns a policy to grasp and relocate a blue ball from a tabletop to a green region (Fig. 1). We formulate this task as a Constrained MDP, (S,A,T,R,p0,γ,C,γc)(S,A,T,R,p_{0},\gamma,C,\gamma_{c}), where SS is the state space, AA is the action space, TT is the transition function, RR is the reward function, p0p_{0} is the initial state distribution, γ\gamma is a discount factor, CC is the cost function and γc\gamma_{c} is a cost discount factor.

In a typical episode, at each time tt, the agent receives an observation oto_{t} based on the current state. After the agent takes an action atπθ(ot)a_{t}\sim\pi_{\theta}(o_{t}) based on the observation, it gets reward R(st)R(s_{t}) from the environment, incurs a penalty cost C(st)C(s_{t}) and arrives at a new state with observation st+1=T(st,at)s_{t+1}=T(s_{t},a_{t}). Based on the RL algorithm, CPO or TRPO, the agent optimizes the corresponding objective function.

Observation space: The observation space is 39-dimensional including 24 hand joint angles, hand translation (3-D), hand orientation (3-D), relative position of the hand with respect to the object (3-D), relative position of the hand with respect to the goal (3-D) and relative position of the object with respect to the goal (3-D).

Action space: Each action for the relocation task is 30-dimensional, including 24 hand joint angles, 3-D hand translation and 3-D hand rotation.

Reward: The agent is rewarded for getting closer to the object, lifting the object up, taking the object closer to the goal, taking the hand closer to the goal, and reaching really close to the goal.

3.4 Constraint Formulation

We define boundary constraints to restrict the initial motion of the robot arm in the direction of the object for relocation. Considering xh\textbf{x}_{h} as the initial hand position, xb\textbf{x}_{b} as the initial object position, x as the current hand position, the constraints are defined in Eq. 3, where rr is the boundary radius. The same can be visualized in Fig. 2. The derivation of the constraints can be found in Appendix A.

|(xbxh)×(xhx)||xbxh|r\displaystyle\frac{|(\textbf{x}_{b}-\textbf{x}_{h})\times(\textbf{x}_{h}-\textbf{x})|}{|\textbf{x}_{b}-\textbf{x}_{h}|}\leq r (3)
0(xxh).(xbxh)|xbxh|21\displaystyle 0\leq\frac{(\textbf{x}-\textbf{x}_{h})\ .\ (\textbf{x}_{b}-\textbf{x}_{h})}{|\textbf{x}_{b}-\textbf{x}_{h}|^{2}}\leq 1

We penalize the agent with a fixed penalty cost whenever the agent violates any of the formulated constraints. If it violates both, it receives twice the penalty cost (set to 0.01). For practical purposes, we relax the second constraint to range between -0.1 and 1.1. A tight second constraint does not allow the robot to learn grasping the ball.

4 Experiments

Refer to caption
Figure 4: Experiment with constraint parameters. Left: (Top) A smaller radius takes significantly more samples to train, whereas relaxed radii are more sample efficient. (Bottom) The average number of violations also reduce more quickly for a relaxed constraint than a smaller one, although they are obviously lower to begin with. Center: (Top) As the cost limit decreases, the sample efficiency also decreases. (Bottom) CPO optimizes perfectly for the respective limits and maintains the allowed cost throughout most of the training. Right: (Top) Higher the penalty cost for each violation, the longer it takes for the policy to train. (Bottom) Scaling of the penalty costs does not really impact how the average number of violations reduce during training.
Refer to caption
Figure 5: Average constraint violations on 500 roll outs after training. The CPO policy maintains lesser number of violations and is thus safer to roll out in the real world than the TRPO policies.

We design different experiments to evaluate the following research questions:

  1. 1.

    Does the policy learned via CPO allow safe training and safe roll outs? How does it compare to a TRPO policy?

  2. 2.

    What is the effect of changing the constraint boundary radius on training CPO?

  3. 3.

    What is the effect of changing the allowed cost limit on training CPO?

  4. 4.

    What is the effect of changing the cost of each penalty on training CPO?

4.1 Setup

We use the MuJoCo physics simulator as our testing platform. All our RL policies are pre-trained using behavior cloning via 25 demonstrations collected via CyberGlove III Rajeswaran et al. (2017). Our policy network is a Gaussian Multi-Layer Perceptron (MLP) with two hidden layers of 32 neurons each. We also train a value network and a cost value network (only for CPO), both MLPs with two hidden layers of 128 neurons each. The learning rate for behavior cloning on our policy network, value network and cost value network is 0.001. For training via CPO and TRPO algorithms, our reward and cost discount factors are both 0.995 and the GAE parameter is 0.97. The constraint configurations for our different experiments are detailed in Appendix B.

4.2 Experiment 1: Evaluating CPO and TRPO

We evaluate both CPO and TRPO algorithms on the relocation task to verify the effect of optimizing for constraints on sample efficiency and average cost during training. We evaluate two variants of the TRPO algorithm – one where the reward is penalized with the incurred cost (TRPO-RP), and one without the penalty (TRPO).

We see that CPO has a lower average number of violations than both the TRPO policies for three different intensities of the boundary constraint. From Fig. 3 (bottom), we see that CPO learns to satisfy the constraints better throughout the training process, making it potentially safer to train on real robots. Qualitative results showing the behavior for all three algorithms during early, mid and late training can be found in the video at this link. CPO also has lower sample efficiency for the three constraint cases, which is a small trade-off to ensure safer learning (Fig. 3 (top)).

We also evaluate the average number of violations for the CPO and TRPO policies after training. We find that the CPO policy continues to maintain fewer violations and is thus safer to roll out for real-world applications than the TRPO policies even though all the policies can perform the task successfully 95%\geq 95\% of the times (Fig. 5).

4.3 Experiment 2: Effect of Boundary Radius

We evaluate the effect of changing the boundary radius in our constraint formulation on training a CPO policy. We find that a tighter radius takes significantly more samples to train, whereas a relaxed radius is more sample efficient (Fig. 4 (left top)). This behavior can also be reinforced by the average number of violations reducing more quickly for a relaxed constrained than a tighter one, although it is obviously lower to begin with (Fig. 4 (left bottom)).

4.4 Experiment 3: Effect of Cost Limits

We also evaluate how changing the overall cost limit affects the way CPO policies are trained. We see that as the cost limit decreases, the sample efficiency also decreases (Fig. 4 (center top)). From the average number of violations plot (Fig. 4 (center bottom)), we see that CPO optimizes perfectly for the respective limits and maintains the allowed cost throughout most of the training.

4.5 Experiment 4: Effect of Penalty Costs

We evaluate how changing the scale of penalties impacts the way CPO trains. We linearly scale the cost limits in this case to maintain the same number of allowed constraint violations. We observe that the higher the penalty cost per violation, the longer it takes for the policy to train (Fig. 4 (right top)). However, scaling of the penalty costs does not really impact how the average number of violations reduce during training (Fig. 4 (right bottom)).

5 Conclusion

We explore adding constraints to an object relocation task to potentially enable safe training on real robots. We formulate a constraint that restricts a robot hand’s motion to within a boundary when approaching the object. We then learn a policy that uses CPO to optimize the constraint cost. We find that learning to follow the constraints via CPO reduces the average cost during training and roll outs, especially when compared to TRPO. We observe consistency in this result across different constraint boundaries and throughout the training process. We also evaluate the effect of changing the boundary radius, cost limits, and penalty costs on training CPO. We find that smaller constraints and larger penalty costs reduce training efficiency. We conclude that the cylindrical boundary constraint we formulate for the relocation task can help to quickly learn safe motion in training and roll out, and can thus be used for training dexterous manipulation tasks safely on real world robots.

6 Future Work

To further investigate the robustness of our boundary constraint and approach, we plan to evaluate our methods on additional dexterous manipulation tasks, such as using a hammer and opening a door. We also plan to formulate a constraint that restricts the motion of the robot after the object has been grasped to further ensure safety throughout the trajectory. Finally, we plan to implement the CPO algorithm on a real robot, such as Shadow Hand and evaluate the effectiveness of our algorithm for real-world applications.

References

  • Achiam et al. [2017] Joshua Achiam, David Held, Aviv Tamar, and Pieter Abbeel. Constrained policy optimization. In International conference on machine learning, pages 22–31. PMLR, 2017.
  • Akkaya et al. [2019] Ilge Akkaya, Marcin Andrychowicz, Maciek Chociej, Mateusz Litwin, Bob McGrew, Arthur Petron, Alex Paino, Matthias Plappert, Glenn Powell, Raphael Ribas, et al. Solving rubik’s cube with a robot hand. arXiv preprint arXiv:1910.07113, 2019.
  • Englert and Toussaint [2016] Peter Englert and Marc Toussaint. Combined optimization and reinforcement learning for manipulation skills. In Robotics: Science and systems, 2016.
  • Gu et al. [2017] Shixiang Gu, Ethan Holly, Timothy Lillicrap, and Sergey Levine. Deep reinforcement learning for robotic manipulation with asynchronous off-policy updates. In 2017 IEEE international conference on robotics and automation (ICRA), pages 3389–3396. IEEE, 2017.
  • Huang et al. [2019] Sandy H Huang, Martina Zambelli, Jackie Kay, Murilo F Martins, Yuval Tassa, Patrick M Pilarski, and Raia Hadsell. Learning gentle object manipulation with curiosity-driven deep reinforcement learning. arXiv preprint arXiv:1903.08542, 2019.
  • Kobayashi et al. [2005] Yuichi Kobayashi, Hiroki Fujii, and Shigeyuki Hosoe. Reinforcement learning for manipulation using constraint between object and robot. In 2005 IEEE International Conference on Systems, Man and Cybernetics, volume 1, pages 871–876. IEEE, 2005.
  • Li et al. [2017] Zhijun Li, Ting Zhao, Fei Chen, Yingbai Hu, Chun-Yi Su, and Toshio Fukuda. Reinforcement learning of manipulation and grasping using dynamical movement primitives for a humanoidlike mobile manipulator. IEEE/ASME Transactions on Mechatronics, 23(1):121–131, 2017.
  • Rajeswaran et al. [2017] Aravind Rajeswaran, Vikash Kumar, Abhishek Gupta, Giulia Vezzani, John Schulman, Emanuel Todorov, and Sergey Levine. Learning complex dexterous manipulation with deep reinforcement learning and demonstrations. arXiv preprint arXiv:1709.10087, 2017.
  • Schulman et al. [2015] John Schulman, Sergey Levine, Philipp Moritz, Michael I. Jordan, and Pieter Abbeel. Trust region policy optimization. CoRR, abs/1502.05477, 2015.

Appendix A Deriving the constraints

Consider xh=(xh,yh,zh)\textbf{x}_{h}=(x_{h},y_{h},z_{h}) to be the initial position of the hand, xb=(xb,yb,zb)\textbf{x}_{b}=(x_{b},y_{b},z_{b}) to be the initial position of the object. Any point, p lying on a line passing through these two points can be written using a parameter tt as:

p=[xh+(xbxh)tyh+(ybyh)tzh+(zbzh)t]\textbf{p}=\begin{bmatrix}x_{h}+(x_{b}-x_{h})t\\ y_{h}+(y_{b}-y_{h})t\\ z_{h}+(z_{b}-z_{h})t\end{bmatrix} (A.4)

The squared distance between any point on this line with parameter tt and a point x=(x,y,z)\textbf{x}=(x,y,z) is given by:

d2=[(xhx)+(xbxh)t]2\displaystyle d^{2}=[(x_{h}-x)+(x_{b}-x_{h})t]^{2} (A.5)
+[(yhy)+(ybyh)t]2\displaystyle+[(y_{h}-y)+(y_{b}-y_{h})t]^{2}
+[(zhz)+(zbzh)t]2\displaystyle+[(z_{h}-z)+(z_{b}-z_{h})t]^{2}

To minimize this distance, we set (d2)/t=0\partial(d^{2})/\partial t=0 and solve for tt to get:

t=(xxh)(xbxh)|xbxh|2t=\frac{(\textbf{x}-\textbf{x}_{h})\cdot(\textbf{x}_{b}-\textbf{x}_{h})}{|\textbf{x}_{b}-\textbf{x}_{h}|^{2}} (A.6)

Plugging tt back into Eq. A.5 and simplifying, we get:

d2=|xhx|2|xbxh|2[(xhx)(xbxh)]2|xbxh|2\displaystyle d^{2}=\frac{|\textbf{x}_{h}-\textbf{x}|^{2}|\textbf{x}_{b}-\textbf{x}_{h}|^{2}-[(\textbf{x}_{h}-\textbf{x})\cdot(\textbf{x}_{b}-\textbf{x}_{h})]^{2}}{|\textbf{x}_{b}-\textbf{x}_{h}|^{2}} (A.7)

Using the vector quadruple product

(A×B)2=A2B2(AB)2(\textbf{A}\times\textbf{B})^{2}=\textbf{A}^{2}\textbf{B}^{2}-(\textbf{A}\cdot\textbf{B})^{2}

we obtain

d2=|(xbxh)×(xhx)|2|xbxh|2d^{2}=\frac{|(\textbf{x}_{b}-\textbf{x}_{h})\times(\textbf{x}_{h}-\textbf{x})|^{2}}{|\textbf{x}_{b}-\textbf{x}_{h}|^{2}}

Applying the square root on both sides leads us to the following equation:

d=|(xbxh)×(xhx)||xbxh|d=\frac{|(\textbf{x}_{b}-\textbf{x}_{h})\times(\textbf{x}_{h}-\textbf{x})|}{|\textbf{x}_{b}-\textbf{x}_{h}|} (A.8)

To define a cylindrical boundary constraint as visualized in Fig. 2, we want the hand position x to be within a fixed distance rr of the line joining xh\textbf{x}_{h} and xb\textbf{x}_{b}. This implies that drd\leq r, or

|(xbxh)×(xhx)||xbxh|r\frac{|(\textbf{x}_{b}-\textbf{x}_{h})\times(\textbf{x}_{h}-\textbf{x})|}{|\textbf{x}_{b}-\textbf{x}_{h}|}\leq r (A.9)

Since we also do not want to consider the entire line joining xh\textbf{x}_{h} and xb\textbf{x}_{b}, but only the line segment joining these two points, from Eq. A.4, we can say that 0t10\leq t\leq 1, or

0(xxh).(xbxh)|xbxh|210\leq\frac{(\textbf{x}-\textbf{x}_{h})\ .\ (\textbf{x}_{b}-\textbf{x}_{h})}{|\textbf{x}_{b}-\textbf{x}_{h}|^{2}}\leq 1 (A.10)

Thus, Eq. A.9 and Eq. A.10 form our set of constraints for the objection relocation dexterous manipulation task.

Appendix B Constraint configurations

We use five constraint parameters for our experiments – boundary radius (rr), penalty cost (cc), cost limit (clcl), minimum tt value (tmint_{min}), and maximum tt value (tmaxt_{max}).

For all our experiments, tmint_{min} is fixed at 0.1-0.1 and tmaxt_{max} is fixed at 1.11.1. We relax the original constraint range between 0 and 11 (Eq. A.10) for practical purposes.

B.1 Experiment 1: Evaluating CPO and TRPO

For the first experiment, we use a fixed penalty cost (c=0.01c=0.01) and cost limit (cl=0.25cl=0.25), and use three different constraint radii.

  • r=0.1r=0.1 for large constraint radius

  • r=0.05r=0.05 for medium constraint radius

  • r=0.03r=0.03 for small constraint radius

B.2 Experiment 2: Effect of Boundary Radius

For the second experiment, we use fixed penalty cost (c=0.01c=0.01) and cost limit (cl=0.25cl=0.25), and use three different constraint radii.

  • r=0.15r=0.15 for large constraint radius

  • r=0.05r=0.05 for medium constraint radius

  • r=0.03r=0.03 for small constraint radius

B.3 Experiment 3: Effect of Cost Limits

For the third experiment, we use fixed penalty cost (c=0.01c=0.01) and boundary radius (r=0.05r=0.05), and use three different cost limits.

  • cl=0.5cl=0.5 for large cost limit

  • cl=0.25cl=0.25 for medium cost limit

  • cl=0.1cl=0.1 for small cost limit

B.4 Experiment 4: Effect of Penalty Costs

For the fourth experiment, we use a fixed boundary radius (r=0.05r=0.05), and three different penalty costs. We scale the cost limits linearly with the penalty costs.

  • c=10,cl=250c=10,cl=250 for large penalty cost

  • c=0.1,cl=2.5c=0.1,cl=2.5 for medium penalty cost

  • c=0.01,cl=0.25c=0.01,cl=0.25 for small penalty cost