Velocity-History-Based Soft Actor-Critic:
Tackling IROS’24 Competition “AI Olympics with RealAIGym”

Tim Lukas Faust¹, Habib Maraqten^1,2, Erfan Aghadavoodi¹, Boris Belousov² and Jan Peters^1,2,3,4 ¹Intelligent Autonomous Systems Lab, Department of Computer Science, TU Darmstadt, Germany, [email protected]²German Research Center for AI (DFKI)³Centre for Cognitive Science, Technical University of Darmstadt⁴Hessian Center for Artificial Intelligence (Hessian.AI), DarmstadtWe thank Hessisches Ministerium für Wissenschaft und Kunst for the grant “Einrichtung eines DFKI Labors an der TU Darmstadt”.

Abstract

The “AI Olympics with RealAIGym” competition challenges participants to stabilize chaotic underactuated dynamical systems with advanced control algorithms. In this paper, we present a novel solution submitted to IROS’24 competition, which builds upon Soft Actor-Critic (SAC), a popular model-free entropy-regularized Reinforcement Learning (RL) algorithm. We add a ‘context’ vector to the state, which encodes the immediate history via a Convolutional Neural Network (CNN) to counteract the unmodeled effects on the real system. Our method achieves high performance scores and competitive robustness scores on both tracks of the competition: Pendubot and Acrobot.

I Introduction

The AI Olympics competition [1], organized by the German Research Center for Artificial Intelligence (DFKI) and based on the RealAIGym project [2], focuses on advancing the athletic intelligence of robots. This competition provides standardized benchmarking tasks for underactuated inverse double pendulum systems, specifically targeting the swing-up and stabilization of the pendulum from a hanging position to an upright stance. Performance is evaluated on a single run in simulation and robustness is evaluated during numerous simulations using physical parameter variations, noise, and other perturbations. The real system’s performance is evaluated on an average of $10$ runs. The competition includes two robotic configurations: the Acrobot, with an actuator at the elbow joint, and the Pendubot, with an actuator at the shoulder joint [3]. These configurations present significant challenges from a control perspective due to their underactuated and chaotic nature. The results, related work, and lessons from the first AI Olympics competition have been presented in [1]. The competing methods in 2023 included a model-based approach (Monte-Carlo Probabilistic Inference for Learning Control (MCPILCO) [4]), model-free RL with stabilization (SAC with LQR [5]), and value-based DQN [6].

A crucial challenge in the IROS’24 version of the competition is the presence of randomized disturbances to the system, as strong pushes against the pendulum, at random times during the swing-up and stabilization. Since the disturbances act as additional physical effects, we propose to add history to the state to counteract these effects efficiently. The idea of encoding history in partially observed environments has been explored in [7], where a Long Short-Term Memory (LSTM) network was embedded into the Deterministic Policy Gradient (DPG) algorithm. While their work focused on memory-based control in partially observable settings, our approach focuses on finding an optimal policy across a variety of environments, regardless of the availability of direct state observations. To address the challenges of controlling underactuated systems in the second AI Olympics [1], we introduce a novel RL approach utilizing a version of the Soft Actor-Critic (SAC) [8] algorithm, in which the actor and the critic network use convolutional layers for temporal context feature extraction.

II Background

Soft Actor-Critic (SAC) [8] is a model-free, off-policy deep reinforcement learning algorithm enhanced with entropy regularization. It belongs to the actor-critic family, which combines value-based and policy-based methods. The actor, represented by a policy $\pi_{\phi}(a_{t}|s_{t})$ with parameters $\phi$ , selects actions based on the current state. The critic, parameterized by $\theta$ , evaluates the action-value (Q-function). The critic uses a minimum of two networks to minimize the probability of overestimating the Q-value [9]. In the entropy framework, the actor aims to maximize the reward while also maximizing the expected entropy $\mathcal{H}$ of the policy during training. Since the policy should explore more in regions of unknown rewards, the temperature parameter is automatically obtained by constraint optimization [10]. The SAC objective includes an entropy term, and it is given by

J(\pi)=\sum_{t=0}^{T}\mathbb{E}_{(s_{t},a_{t})\sim\rho_{\pi}}\left[r(s_{t},a_{t})+\alpha\mathcal{H}(\pi(\cdot|s_{t}))\right].

The parameter $\rho_{\pi}$ describes the state-action marginal trajectory distribution induced by a policy $\pi$ . In this paper, SAC implementation from Stable-Baselines3 [11] is used.

III Method

Our method consists of four components: i) a specific model architecture for history encoding (Sec. III-A), ii) reward design (Sec. III-B), iii) system identification (Sec. III-C), and iv) training procedure (Sec. III-D).

III-A Model Architecture

Our model architecture for encoding the history and providing it to the actor and critic in SAC is shown in Fig. 1. It combines the current measurement (blue) with a learned contextual representation (red) capturing the current system dynamics. This additional context is necessary for a non-Markovian system, a partially observed system, or when the system dynamics are not known a priori but need to be inferred. For example, the torque needed to swing-up a pendulum depends on the links’ masses, which are not directly observable from a single-state measurement. Our approach, therefore, incorporates a temporal component that contextualizes the current state, implicitly encoding the system dynamics along with the measured position and velocity.

The embedding of the history into a context vector is achieved through the use of a CNN, which processes a sequence of $12$ past velocity measurements. This allows the model to infer the system dynamics and thereby be more robust to domain shift. After two $1$ -dimensional convolutional layers with a kernel size of $5$ and an output channel size of $12$ , the resulting vector is passed through two linear layers with a width of $256$ , using ReLU activation after each layer and a $\tanh$ activation after the final linear layer.

The parameters of the model (e.g., history length, use of angular velocities as input) are chosen based on sensitivity analysis: evaluating the contribution of inputs by analyzing the gradients of the actor loss with respect to the input state (including the history). Past actions are not included in the context because adding them provided no improvement to the model performance.

Actor and critic maintain their own copies of the model, i.e., there is no weight sharing; this provides additional flexibility and leads to better results. At the end, our context vector is passed together with the state vector to the actor and critic, whose two fully-connected layers have size of $1024$ (instead of default $256$ as in StableBaselines3).

As an alternative to CNN, we experimented with using an LSTM network. However, while LSTM removes the need for padding, inference time increases dramatically and training is more sensitive to hyperparameters and reward changes.

Refer to caption — Figure 1: Our model architecture for encoding the history into a context representation. A sequence of past velocity measurements is passed through convolutional and fully-connected layers, and the output is attached to the current measurement before being passed to the actor and critic in SAC.

III-B Reward Design

In the competition, the policies are evaluated on a Performance Score (PS) and on a Robustness Score (RS). However, both of these measures only provide a sparse signal – being evaluated once after the end of a trajectory. Therefore, we found it to be important to design a dense reward, to provide a more informative signal to the agent. This improves both the learning speed and the final policy performance.

We formulate the reward in terms of normalized state and action variables. State $s=(q_{1},q_{2},\dot{q}_{1},\dot{q}_{2})$ consists of the joint positions and velocities of the shoulder and elbow joints, respectively. The angles and velocities are limited in the ranges $\pm 4\pi$ and $\pm 30\ \mathrm{rad/s}$ , therefore we normalize them both to $\pm 1$ . Similarly, the 1D action $a$ , with the range $\pm 6$ , is normalized to $\pm 1$ . The goals state is $\boldsymbol{s}_{\textrm{goal}}=(\pi,0,0,0)$ .

III-B1 Reward 1: Efficiency Through Reward

In this section, we introduce a different reward function, which we do not use in the final implementation because it achieved lower scores than the reward of Section III-B1. This reward function is formulated as in Equation 1.

		$\displaystyle R_{1}(s,a)=r_{1}(d)+r_{2}(\dot{q}_{1},\dot{q}_{2},u)\cdot f_{1}(d,v_{2}),\;\text{where}$		(1)
		$\displaystyle r_{1}(d)=\frac{1}{2d+1}-\frac{1}{5},$
		$\displaystyle f_{1}(d,v_{2})=\begin{cases}1&\text{if }d\leq 0.5,\\ 1-\frac{1}{1+\exp(-10(v_{2}+0.2))}&\text{if }d>0.5,\end{cases}$
		$\displaystyle r_{2}(\dot{q}_{1},\dot{q}_{2},u)=\frac{1}{4\left(\frac{\dot{q}_{1}^{2}+\dot{q}_{2}^{2}}{400}+\frac{u^{2}}{20}\right)+1}.$

It combines a Cartesian distance-based term $r_{1}$ that encourages small distance $d$ to the goal with a positive bonus $r_{2}\cdot f_{1}$ that encourages small angular velocities $\dot{q}_{1}$ , $\dot{q}_{2}$ and the applied torque $u$ . The distance $d$ is normalized between $0$ and $2$ and the normalized pendulum length is $1$ . The parameter $v_{2}$ is the Cartesian end-effector velocity.

Even though reward function (1) did provide a decent performance score, robustness was very difficult to achieve. This led to the idea of learning the state dynamics temporal information implicitly, which is described in Sec. III-A. Since the factor $f_{1}$ is only an attempt to fix the more general problem of high rewards in the starting position due to low torque and velocities, this reward results in a poorly formulated optimization problem with a narrow solution space. While this reward produced a surprisingly efficient swing-up in one go, it did not manage to find a good policy that utilizes swinging back first to build up momentum, which benefits efficient torque usage.

III-B2 Reward 2: Efficiency Through Punishment

While (1) rewards good behavior with positive terms, we found giving negative punishments instead leads to better performance and higher robustness. Namely, we propose the reward

\displaystyle R_{2}(s,a)=-0.05\cdot((q_{1}-\pi)^{2}+q_{2}^{2})-\mathcal{E}(s,a)

(2)

comprised of two terms: a squared angular distance to the goal and a regularization term $\mathcal{E}$ , that captures the ‘effort’ of the agent to swing up the pendulum,

	$\displaystyle\mathcal{E}$	$\displaystyle=\left(0.02\cdot\left(\dot{q}_{1}^{2}+\dot{q}_{2}^{2}\right)+0.25\cdot\left(a^{2}+2\cdot\|a\|\right)\right.$		(3)
		$\displaystyle+0.02\cdot\|a-a_{\textrm{prev}}\|/\mathrm{d}t+0.05\cdot\left\|\dot{q}_{i}\cdot a\right\|\Big{)}\cdot\beta.$		(3)

Here $(\beta=0.1,\dot{q}_{i}=q_{1})$ for Pendubot; $(\beta=0.025,\dot{q}_{i}=q_{2})$ for Acrobot. The weightings in (3) can be tuned to emulate the Performance Score. In contrast to (1), reward (2) is zero at optimum, making learning the $Q$ -function dramatically more stable since an eventual successful swing-up policy does not suddenly change the $Q$ -values of all previous states. This is a very apparent problem with the previous positive reward (1) and leads to frequent policy degradation after learning to swing up for the first time. In Section III-B1 we present another reward function, that results in a different and less efficient policy, hence not being included in the final implementation.

III-C System Identification

The downside of using a simulated environment for training an agent to perform tasks on a real system is the so-called Sim-to-Real gap [12]. This gap was quite noticeable as a trained policy could not initially achieve the real system’s swing-up and stabilization task, albeit succeeding in the simulated environment. Since the success of training in a simulated environment is limited by the accuracy of the simulated model, our approach is optimizing for physical parameters $\theta_{m}$ that result in simulated trajectories that are closer to the real system for the same torque.

To minimize the Sim-to-Real gap, we performed a system identification by solving an optimization problem using a differential evolution algorithm. We collected trajectories for Pendubot and Acrobot on the real system using a policy trained in simulation. Each measured time series of the applied torque on the real system from a selected subset of all recordings is then used also to create a trajectory on the simulated system. We take the squared error between each simulated and real trajectory for all torque time series as a cost function for the optimization,

J(\theta_{m})=\sum_{i=1}^{N_{\text{traj}}}\sum_{t=1}^{T_{i}}\sum_{j=1}^{m}w(t,T_{i})\left(x_{\text{sim},j}^{(i)}[t;\theta_{m}]-x_{\text{real},j}^{(i)}[t]\right)^{2},

(4)

where the terms are importance-weighted with

w(t,T_{i})=\left(1-\frac{0.5(t-1)}{T_{i}-1}\right).

(5)

Here $j$ is the state index and the most inner sum reaches over state indices $m=4$ for the current time step and a current sequence. The mid sum is summing over the timesteps in one trajectory $T_{i}$ , where $t$ denotes the current timestep. The outer sum goes over the trajectories $N_{\textrm{traj}}$ , and $i$ denotes the current trajectory. Sobol sampling [13] is used for initialization. Hyperparameters for the Differential Evolution [14] algorithm are provided in Table III in the appendix.

Because of the double pendulum’s sensitive behavior, the cost function (4) becomes very difficult to optimize a few seconds into the recording. Therefore we only optimize on the first $1.5$ seconds of each trajectory which is sufficient for a good identification. Additionally, we adjust the error weighting (5) so that errors accumulating over time have less impact, emphasizing having a clean initial trajectory. The weighting decreases linearly from $1$ at the start to $0.5$ by the end of the trimmed trajectory. Since the optimization converges to different local optima when repeated, we include multiple solutions of optimized parameters in the multi-environment training strategy for training a new agent to avoid overfitting on a single set of parameters.

III-D Training Procedure

III-D1 Multi-Environment Training

To achieve robust performance over a range of disturbances, we employ multi-environment training, where each environment focuses on one type of disturbance. The environment is reset after each episode with a different disturbance value to prevent overfitting to any single environment. Evaluation is also conducted on a large set of environments with uniformly distributed, constant disturbance values.

We exclude disturbances of the type “torque perturbations” from the training because they make learning too challenging. Since torque perturbations directly affect the actions, their $Q$ -values become hard to learn in such a sensitive environment as an underactuated double-pendulum.

Furthermore, we exclude “action noise” and “action responsiveness” (which simulates integrator effect on the actions), because both of these scores are maxed out even without using history, so there is no need to incorporate them into the training process.

III-D2 Data Collection

The controller benchmarks are simulated at a frequency of 500 Hz for a duration of 10 seconds. Using such a high frequency for RL training would result in lengthy episodes with the need for a high discount factor and a significantly increased training time. Therefore training is performed with a control frequency of only 50 Hz and an episode length of 5 seconds, which is still sufficient for learning to swing-up and stabilize the pendulum, and it results in twice as many valuable swing-up experiences in the buffer.

A beneficial side effect of the difference in control vs. simulation update rates is that during benchmarking, the simulation provides $10$ measurements for each applied action. These additional measurements can be utilized to filter out velocity measurement noise during robustness evaluation. The controller averages all available velocity measurements between the 50 Hz sampling points of the state history to estimate the velocity at the end of that window. This way, any controller update rate is possible. To accommodate the resulting time shift, a measurement delay for velocity of half the timestep length is introduced during training. Also, the maximum standard deviation of velocity noise in robustness evaluation is reduced by a factor of $\sqrt{10}$ in the corresponding training environment to simulate the theoretical noise reduction later achieved through averaging. Integrating this filtering method into the controller enables us to achieve velocity noise robustness of $100\%$ (cf. Table II).

IV Results and Discussion

In this section, the results for Pendubot are shown, achieved with the reward function (2). The training takes under $3$ hours on an NVIDIA GeForce RTX 4070 GPU. The Performance Score metrics for our controller (History SAC) and three benchmark controllers (iLQR with Riccati Gains, TVLQR, and iLQR MPC) are shown in Table I. Our controller shows a relatively quick swing-up time of $1.15$ seconds while consuming less energy than the baselines (see Fig. 2).

TABLE I: Performance Score Metrics for Pendubot

Metric	History SAC	iLQR Riccati	TVLQR	iLQR MPC
Swingup Time [s]	1.18	4.13	4.13	4.12
Energy [J]	7.76	9.54	9.54	9.92
Torque Cost [N²m²]	2.37	1.25	1.26	1.77
Trq Smoothness [Nm]	0.014	0.005	0.007	0.083
Velocity Cost [m²/s²]	68.42	211.34	211.12	211.98
RealAI Score	0.634	0.536	0.526	0.352
Username	tfaust	fwiebe	fwiebe	fwiebe

The Robustness Score metrics are shown in Table II and Figure 3. Our model achieves a significantly higher robustness score of $0.905$ compared to the baselines.

TABLE II: Robustness Score Metrics for Pendubot

Metric [%]	History SAC	iLQR Riccati	TVLQR	iLQR MPC
Model	92	5.2	62.9	31.9
Velocity Noise	100	42.9	90.5	81
Torque Noise	100	9.5	100	100
Torque Step Response	100	85.7	100	100
Time delay	81	14.3	42.9	38.1
Perturbations	70	0	60	48
Robustness Score	0.905	0.263	0.76	0.665
Username	tfaust	fwiebe	fwiebe	fwiebe

During training, our primary focus was on achieving optimal performance on Pendubot. The incorporation of state history significantly enhanced the model robustness. The main challenge we encountered was high friction in the second joint, which requires a much higher torque early in the episode when there is no history to analyze yet. Nevertheless, although perturbation robustness was not explicitly incorporated into training, our model still outperformed the benchmark controllers. Our velocity noise filtering method (Sec. III-D) achieved $100$ % robustness. Because robustness evaluation takes a long time, only most promising models were tested.

We encountered a delay robustness of $100$ % many times but not in the best overall results shown in Fig. 3. We still believe $100$ % delay robustness could have been possible without compromising performance on the other metrics given more time to finetune. Even with the current $81$ % delay robustness, our approach clearly outperforms all benchmark controllers, as shown in Table II. The performance score evaluation of the model in Table I is mainly attributable to the reward function; the inclusion of the history has no effect. By carefully tuning the reward weights, we were able to outperform all benchmark controllers. Only optimizing on the default environment resulted in a score of $0.683$ .

The Acrobot results are similar, though slightly worse, as stabilizing the setup is more challenging. Due to time constraints of the challenge, less focus was placed on achieving the highest possible score. As a result, we chose to omit a discussion of the Acrobot scores, as it would offer no additional insights beyond those already presented. A direct comparison with a default SAC without history is not feasible, as in the multi-environment training setup, a model lacking contextual information cannot learn how to swing up and stabilize at all. This indicates the importance of incorporating a state history, allowing the model to infer its current environment and make informed decisions.

V Conclusion

We presented a history-based version of SAC, that infers system dynamics from temporal data (Sec. III-A), allowing the model to find a good policy for many different environments, utilizing the inferred context.

We implemented two reward functions (Sec. III-B), of which the reward in (1) did not achieve a globally optimal swing-up behavior due to its limiting nature, leading to the reward (2), with which the results in this paper are produced. The main practical difference between these two rewards is the negative sign leading to a reward of $0$ in the goal state of reward (2), which greatly improves stability during training.

The achieved Performance Score for Pendubot is $0.634$ and the Robustness Score is $0.905$ , both outperforming the benchmark controllers. Achieving a high Performance Score can be easily achieved by finetuning the weights for the different effort punishments and overfitting on the default environment, in which the score is evaluated. Thus, there is a trade-off when optimizing both scores simultaneously.

The main focus of this paper is to achieve good performance in various environments, which is made possible by context extraction. Still, specific physical parameters proved to be too difficult to compensate for since the necessary policy is too different from policies for the other environments. Future work should explore a more effective method to prepare the agent for torque perturbations without negatively impacting training. One promising approach is initializing episodes in a representative state during such a perturbation. This way, the $Q$ -values of earlier states are unaffected since those states do not exist in this setup. However, this method could potentially hinder learning the context because the missing history will be filled with padding if it is not properly initialized as well.

VI Appendix

Table III shows hyperparameters for the Differential Evolution algorithm, which optimizes the physical parameters during system identification (Sec. III-C).

TABLE III: Differential Evolution Hyperparameters

Parameter	Initial	Lower	Upper	Description
	Value	Bound	Bound
strategy	best1bin	-	-	Strategy for selection and crossover
maxiter	1,000,000	-	-	Maximum iterations
popsize	2048	-	-	Size of the population
tol	$1\times 10^{-10}$	-	-	Tolerance for stopping
mutation	(0.5, 1.0)	0.5	1.0	Mutation factor range
recombination	0.9	0.0	1.0	Crossover probability
polish	False	-	-	Perform local search after optimization
updating	deferred	-	-	Update mode for population
workers	-1	-	-	Number of parallel workers
disp	True	-	-	Display convergence messages
init	population	-	-	Initial population

References

[1] F. Wiebe, N. Turcato, A. Dalla Libera, C. Zhang, T. Vincent, S. Vyas, G. Giacomuzzo, R. Carli, D. Romeres, A. Sathuluri, et al., “Reinforcement learning for athletic intelligence: Lessons from the 1st “ai olympics with realaigym” competition,”,” in Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence, IJCAI-24, K. Larson, Ed. International Joint Conferences on Artificial Intelligence Organization, vol. 8, pp. 8833–8837, 2024.
[2] F. Wiebe, S. Vyas, L. Maywald, S. Kumar, and F. Kirchner, “Realaigym: Education and research platform for studying athletic intelligence,” in Proceedings of Robotics Science and Systems Workshop Mind the Gap: Opportunities and Challenges in the Transition Between Research and Industry, New York, 2022.
[3] F. Wiebe, S. Kumar, L. J. Shala, S. Vyas, M. Javadi, and F. Kirchner, “Open source dual-purpose acrobot and pendubot platform: Benchmarking control algorithms for underactuated robotics,” IEEE Robotics & Automation Magazine, 2023.
[4] A. D. Libera, N. Turcato, G. Giacomuzzo, R. Carli, and D. Romeres, “Athletic intelligence olympics challenge with model-based reinforcement learning,” IJCAI, 2023.
[5] C. Zhang and A. Sathuluri, “Solving the swing-up and balance task for the acrobot and pendubot with sac,” IJCAI, 2023.
[6] B. B. Theo Vincent, “Deep reinforcement learning for pendubot,” IJCAI, 2023.
[7] N. Heess, J. J. Hunt, T. P. Lillicrap, and D. Silver, “Memory-based control with recurrent neural networks,” arXiv preprint arXiv:1512.04455, 2015.
[8] T. Haarnoja, P. Abbeel, and S. Levine, “Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor,” International conference on machine learning. PMLR, vol. 1, no. 1, pp. p. 1861–1870, 2018.
[9] H. van Hasselt, A. Guez, and D. Silver, “Deep reinforcement learning with double q-learning,” Association for the Advancement of Artificial Intelligence, 2015.
[10] T. Haarnoja, A. Zhou, K. Hartikainen, G. Tucker, S. Ha, J. Tan, V. Kumar, H. Zhu, A. Gupta, P. Abbeel, and S. Levine, “Soft actor-critic algorithms and applications,” arXiv preprint arXiv:1812.05905, 2019.
[11] A. Raffin, A. Hill, A. Gleave, A. Kanervisto, M. Ernestus, and N. Dormann, “Stable-baselines3: Reliable reinforcement learning implementations,” Journal of Machine Learning Research, vol. 22, no. 268, pp. 1–8, 2021.
[12] F. Muratore, M. Gienger, and J. Peters, “Assessing transferability from simulation to reality for reinforcement learning,” IEEE transactions on pattern analysis and machine intelligence, vol. 43, no. 4, pp. 1172–1183, 2019.
[13] I. Sobol’, “On the distribution of points in a cube and the approximate evaluation of integrals,” USSR Computational Mathematics and Mathematical Physics, vol. 7, no. 4, pp. 86–112, 1967.
[14] R. Storn and K. Price, “Differential evolution – a simple and efficient heuristic for global optimization over continuous spaces,” Journal of Global Optimization, vol. 11, no. 4, pp. 341–359, 1997.

Velocity-History-Based Soft Actor-Critic: Tackling IROS’24 Competition “AI Olympics with RealAIGym”