This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

\IEEELSENSarticlesubject

The Pre-print, Submitted to ICCAS

\IEEEtitleabstractindextext

Epersist: A Self Balancing Robot Using PID Controller And Deep Reinforcement Learning

\IEEEauthorblockNGhanta Sai Krishna\IEEEauthorrefmark1\IEEEauthorieeemembermark1, Garika Akshay\IEEEauthorrefmark1, Dyavat Sumith\IEEEauthorrefmark1\IEEEauthorblockA\IEEEauthorrefmark1Department of Data Science and Artificial Intelligence, IIIT Naya Raipur, Chattisgarh, 492101, India
\IEEEauthorieeemembermark1Student Member, IEEE
Corresponding author: Ghanta Sai Krishna (e-mail: [email protected])
Copy righted PreprintAssociate Editor: Digital Object Identifier
Abstract

A two-wheeled self-balancing robot is an example of an inverse pendulum and is an inherently non-linear, unstable system. The fundamental concept of the proposed framework "Epersist" is to overcome the challenge of counterbalancing an initially unstable system by delivering robust control mechanisms, Proportional Integral Derivative (PID), and Reinforcement Learning (RL). Moreover, the micro-controller NodeMCU ESP32 and inertial sensor in the Epersist employ fewer computational procedures to give accurate instruction regarding the spin of wheels to the motor driver, which helps control the wheels and balance the robot. This framework also consists of the mathematical model of the PID controller and a novel self-trained advantage actor-critic algorithm as the RL agent. After several experiments, control variable calibrations are made as the benchmark values to attain the angle of static equilibrium. This "Epersist" framework proposes PID and RL-assisted functional prototypes and simulations for better utility.

{IEEEkeywords}

Two Wheeled Self-Balancing Robot, PID, Reinforcement Learning, NodeMCU ESP32

1 Introduction

The two-wheeled self-balancing robot (TWSBR) is a standard robot with applications in various fields, including transportation and exploration. Over the last few decades, academics and industry have been paying close attention to both the design and regulation of the TWSBR. The TWSBR is considered as a high-order, multi-variable, nonlinear, tightly coupled, inherently unstable system. Conventional methods like PID[1], fuzzy[2], and sliding mode control[3] are proposed in the recent times. Two wheeled mobile robot balancing controller has been tackled as either linearized or as a nonlinear model[4]. The physical characteristics of robot are crucial to achieve optimal control [5]. In practical applications It is frequently desirable to obtain optimality beyond simple stabilisation. Though the existing works had achieved maximum stabilization, they are less capable of achieving optimality. The control systems are not optimal, if the stabilization criterion’s depend physical characteristics of robot. So there might be a need of advanced techniques for achieving optimality.

In recent works, Reinforcement learning (RL) has been included in TWSBR as a control mechanism to attain optimality and achieve stability [6]. The RL enables robots to learn, adapt, and optimize their behaviours by interacting with their surroundings. RL solves the optimization issues that involve an agent interacting with its environment and changing its behaviours or control policies in response to inputs or rewards. To attain better stability, RL techniques are widely used in TWSBR, like Q-learning[7], Proximal Policy Optimization (PPO) [8], and Soft Actor-Critic (SAC) [9]. However, the RL frameworks of existing solutions are based on Q-Learning, PPO and SAC. The proposed methodology highlights the benefits of the advantage Actor-Critic Algorithm (A2C) [10] based self-trained model for the self-balancing robot.

Irrespective of the robot’s control mechanisms, it is essential to understand the theoretical aspects like the transfer function[11], stability[12] etc., which are crucial in judging the system’s stability at a time instance. To understand these theoretical aspects of the robot, simulations of the robot are also required. However, the circumstances and the constraints in executing the hardware prototyping are slightly different from the simulations. Moreover, the hardware experimental data is valuable, and RL architectures usually require millions of data volumes [13]. Apart from the robot’s functionality, it is essential to make it more cost-effective. However, the existing solutions utilize micro-controllers such as Raspberry-PI[14], Arduino-UNO [15] etc. The overall cost of the robot by utilizing these micro-controllers is expensive. Thus there is a need to improve the cost-efficiency and optimal utility of the system. To overcome all these disadvantages in the existing solutions, the proposed methodology effectively analyses both theoretical aspects via simulations and the hardware prototyping conditions of the robot.

. Moreover, the proposed solution utilizes Node-MCU ESP-8266 as a micro-controller to make a cost-effective system. To improve the practical utility of the robot, the proposed solution also consists of a mobile interface which connects to the micro-controller. The significant contributions of the paper are as follows :

The significant contributions and improvements from existing works of Epersist are as follows:

\bullet

Analyzing and deriving the theoretical aspects and effects of the PID control mechanism on the robot.

\bullet

Building a novel, robust and less computational A2C algorithm-based Deep Reinforcement Learning agent.

\bullet

Deploying the PID mechanism and RL model in the NodeMCU to build a cost-effective hardware prototype.

\bullet

Deploying a Bluetooth-based mobile application to control the robot for greater utility

Refer to caption

Figure 1: The Conceptual Overview of proposed framework "Epersist"

2 Methodology of Epersist

This section discusses the procedures involved in the proposed methodology, which is categorized into 2 phases- Physical modelling and PID control mechanism; Deep RL agent. The overview of the proposed methodology is represented in Fig. 1.

2.1 Physical Modelling and PID Control Mechanism

The physical model of the robot is examined as the fundamental concept of the "Inverted Pendulum", which involves calculating the transfer function of control variables (pitch and yaw) of the robot is shown in Fig. 2. The robot’s pitch is the angle of deviation along the Y-axis, whereas the yaw is the robot’s position along the X-axis. The entire robot is divided into cart and pendulum based on the weight composition (m1m_{1},m2m_{2}) and position, respectively. The initial conditions of mass and speed are taken to zero (x(0)=0x(0)=0).

Refer to caption

Figure 2: Free-Body Diagram of the Robot
Finput=(m1+m2)x¨+fx˙+m2lϕ¨cos(ϕ)m2lϕ˙2sinϕF_{input}=(m_{1}+m_{2})\ddot{x}+f\dot{x}+m_{2}l\ddot{\phi}cos(\phi)-m_{2}l\dot{\phi}^{2}sin\phi\\ (1)
(I2+m2l2)ϕ¨+m2glsinϕ=m2lx¨cosϕ(I_{2}+m_{2}l^{2})\ddot{\phi}+m_{2}glsin\phi=-m_{2}l\ddot{x}cos\phi (2)

Where, the distance to the pendulum’s centre of mass, ll, and the angle between the pendulum and the Y-axis, ϕ\phi, friction is represented by the parameter ff. The eq. 21, eq. 2 are derived from the fundamental law’s of motion. The resultant transfer functions of both pitch and yaw for the robot are shown in eq. (3), (4) respectively.

Gpitch(s)=m2lqss3+f(I2+m2l2)qs2(m2+m1)m1glqsfm2glqG_{pitch}(s)=\frac{\frac{m_{2}l}{q}s}{s^{3}+\frac{f(I_{2}+m_{2}l^{2})}{q}s^{2}-\frac{(m_{2}+m_{1})m_{1}gl}{q}s-\frac{fm_{2}gl}{q}} (3)
Gyaw(s)=(I2+m2l2)s2gm2lqs4+f(I2+m2l2)qs3(m2+m1)m1glqs2fm2glqsG_{yaw}(s)=\frac{\frac{(I_{2}+m_{2}l^{2})s^{2}-gm_{2}l}{q}}{s^{4}+\frac{f(I_{2}+m_{2}l^{2})}{q}s^{3}-\frac{(m_{2}+m_{1})m_{1}gl}{q}s^{2}-\frac{fm_{2}gl}{q}s} (4)

The transfer function helps understand the system’s fundamental aspects (e.g. stability), which will be explained in further subsections. On the other hand, the 6-axis Inertial Measurement Unit (IMU) sensor calculates the 3-axis gyroscopic (gx,gy,gzg_{x},g_{y},g_{z})and 3-axis accelerometer (ax,ay,aza_{x},a_{y},a_{z}) measures. Computationally, the pitch and yaw calculation is based on the sensor measures, which is different from the conventional mathematical model. The instantaneous pitch angle (ϕ\phi) is dependent on the previous pitch angle (ϕ^(t=0)=0\hat{\phi}(t=0)=0) and is calculated with a complementary filter as derived in eq. 5.

ϕ=[α(ϕ^+gx)]+[(1α)(ATAN2(ay,az))]\phi=[\alpha(\hat{\phi}+g_{x})]+[(1-\alpha)(ATAN2(a_{y},a_{z}))] (5)

Where α\alpha is the filer coefficient, and ATAN2(ay,azATAN2(a_{y},a_{z} represents the angle of (ay,aza_{y},a_{z}) in the plane. The error (e(t)e(t)) at an instance (tt) is the difference between the current pitch angle and target pitch angle. This feedback error is applied to the PID control mechanism at every time instance as dervied in eq. 6. Finally, this is responsible to generate the command for instructing motor driver. For balancing the robot rigidly, the physical parameters play a crucial position. The PID calibration (tuning the Kp,Ki,KdK_{p},K_{i},K_{d} values) is performed with trial-error method.

Ouput=Kpe(t)+Kie(t)𝑑t+Kdde(t)dxOuput=K_{p}e(t)+K_{i}\int e(t)dt+K_{d}\frac{\mathrm{d}e(t)}{\mathrm{d}x} (6)

2.2 The Deep Reinforcement Learning Agent

The novel self-trained deep reinforcement learning agent is based on the advantage Actor-Critic (A2C) algorithm, which consists of two dependent and similar neural networks (actor and critic) and integrates RL’s policy and value algorithms. The actor (policy function) decides an action at each iteration, and the critic (value function) estimates the quality or the value index of a delivered initial state.

Refer to caption

Figure 3: The Architecture of Actor ANN (Previous State - Action)

The Artificial Neural Network (ANN) architecture of both actor and critic is visualized in Fig. 3, 4. Based on the system’s previous state, the actor decides the further action. Similarly, based on the previous action (rewards) and the state, the critic will produce the quality value of that action. The actor will learn from the temporal difference error (TD), calculated by the advantage function in each nested iteration. The detailed mathematical functionality of the A2C algorithm is characterised in algorithm 2.2, where the initial yaw and pitch values are derived in the form of weights YY and θ\theta.

Refer to caption

Figure 4: The Self-Trained Architecture of Critic ANN

[Uncaptioned image]

3 Experimental Results

In this section, the experimental results are presented and analysed. Firstly, the physical modelling and PID tuning are performed based on our system design in the MATLAB Simulink. The RL model for the robot is designed in the Pybullet-OpenAI environment and further injected to the micro-controller. The physical parameters of the robot are shown in Table 1.

Physical Parameter Value/Unit
Mass of main body 135 g
Mass of pendulum 60 g
Diameter of Wheel 5 cm
Distance between wheels 20 cm
Static Friction b/w Surfaces 1.15
Table 1: Physical parameters of Epersist Robot

The hardware prototype proposed is cost-effective and consists of NodeMCU ESP32, IMU MPU 6050, H-1298N motor bridge, 6V gear motors, and a self-designed 3D printed base plate. The NodeMCU ESP32 acts as the micro-controller, in which the RL model or PID mechanism is manually uploaded. The injection of the RL model (.h5 extension) to NodeMCU ESP32 is one of the challenging tasks, which is handled precisely. This micro-controller is compatible with connecting with a mobile phone via Bluetooth. An interactive mobile application is built to provide the robot with initial control variables (pitch, yaw) as shown in Fig. . These interactions between the mobile application and the robot are efficient. An Inertial Measurement Unit (6-axis IMU MPU 6050) sensor is utilised to acquire the 3-axis accelerometer and 3-axis gyroscopic measures. These six measures are responsible for the calculation of the pitch (tilt angle) and yaw (horizontal movement) of the robot.

Refer to caption

Figure 5: Epersist Robot and the Mobile Interface

The optimal calibration points must be identified to calculate the target values of control variables and sensor measures. After several experimentation’s, the Kp,KiK_{p},K_{i} and KdK_{d} values are tuned and are calibrated to 1970, 21950, 19.5 respectively. Furthermore, the calibrated offsets of the sensor measures are shown in Table 2.

Offset Value
X Accelerometer Offset -1780
Y Accelerometer Offset 750
Z Accelerometer Offset 2700
X Gyroscopic Offset 180
Y Gyroscopic Offset 76
Z Gyroscopic Offset 61
Table 2: Offset measures for the IMU MPU 6050 Sensor

Irrespective of the system’s physical characteristics, the trained RL model can be utilized for our robot. As neural networks are involved in the RL model, the training and uploading time to NodeMCU is longer than the PID mechanism. The description and summary of the trained RL model are shown in Table 3.

Parameter A2C
Trained Episodes 7775
Trained Steps 1.5e+06
Total Training Time 19 Hrs
Time To Upload 23.5 Secs
Maximum Possible Reward After Training 60
Maximum Achieved Reward After Training 56.42
Table 3: Summary and Description of RL self-trained RL model

Refer to caption

Figure 6: Comparative assessment of PID and RL mechanism - Pitch Vs Time, Yaw Vs Time

After tuning the Kp,Ki,KdK_{p},K_{i},K_{d} and hyper-parameters of both control mechanisms. The performance of the control mechanisms is analyzed with the sensor measures from the functional prototype. The comparative assessment of both control mechanisms with the hardware prototype is demonstrated in Fig. 6. The comparison between the PID and RL is based on the angle of deviation (pitch) and horizontal distance (yaw) covered by the robot over a while. The final angle of deviation must be approximately zero to call it a balanced system. Our experimentation’s observed that the robot’s overall movement with PID was not as smooth as with RL. The PID-assisted robot’s angle of deviation (pitch) oscillates more than the RL. As the number of oscillations increases, the smoothness of the robot decreases in a unit of time. The mean settling time of the PID robot is 7 seconds, whereas the RL robot takes 5 seconds to achieve maximum stability. Furthermore, the RL robot had covered less distance when compared to the PID robot to achieve maximum stability. The adequate difference between the distance covered by the robot for both control mechanisms is approximately 20 cm. To conclude, the RL control mechanism for the robot is better and more effective than the PID control mechanism in terms of mean settling time and distance travelled in mean settling time.

4 Conclusion

The proposed TWSBR "Epersist" is a flawless end-to-end framework for studies of advanced control techniques due to its complicated task of balancing the structure. The proposed framework has many advantages in terms of the time complexity of the contemporary self-trained RL agent and the cost efficiency of the robot over other existing frameworks. The physical modelling of the system is appropriate, and the outcomes match the experimental results. Initially, the framework is analyzed via simulation (MATLAB for PID, OpenAI for RL), and hardware prototyping is performed with NodeMCU ESP32 micro-controller, IMU MPU 6050 sensor. The PID and RL mechanisms are analyzed from the initial stages based on the physical system design. We represented the performance and experimental results for both control mechanisms. The utility of this robot is simple with the interactive mobile interface, which is connected over Bluetooth. This robot is limited to a random path. The robot cannot move on the desired path. An extension of the desired path can be included in the framework for future work, and the advanced RL control agents can also be included.

References

  • [1] N. T and P. K T, "PID Controller Based Two Wheeled Self Balancing Robot," 2021 5th International Conference on Trends in Electronics and Informatics (ICOEI), 2021, pp. 1-4, doi: 10.1109/ICOEI51242.2021.9453091.
  • [2] E. Susanto, A. Surya Wibowo and E. Ghiffary Rachman, "Fuzzy Swing Up Control and Optimal State Feedback Stabilization for Self-Erecting Inverted Pendulum," in IEEE Access, vol. 8, pp. 6496-6504, 2020, doi: 10.1109/ACCESS.2019.2963399.
  • [3] J. Huang, M. Zhang, S. Ri, C. Xiong, Z. Li and Y. Kang, "High-Order Disturbance-Observer-Based Sliding Mode Control for Mobile Wheeled Inverted Pendulum Systems," in IEEE Transactions on Industrial Electronics, vol. 67, no. 3, pp. 2030-2041, March 2020, doi: 10.1109/TIE.2019.2903778.
  • [4] S. Kim and S. Kwon, "Nonlinear Optimal Control Design for Underactuated Two-Wheeled Inverted Pendulum Mobile Platform," in IEEE/ASME Transactions on Mechatronics, vol. 22, no. 6, pp. 2803-2808, Dec. 2017, doi: 10.1109/TMECH.2017.2767085.
  • [5] M. S. Mahmoud and M. T. Nasir, "Robust control design of wheeled inverted pendulum assistant robot," in IEEE/CAA Journal of Automatica Sinica, vol. 4, no. 4, pp. 628-638, 2017, doi: 10.1109/JAS.2017.7510613.
  • [6] C. Chang and S. Chang, "Using Reinforcement Learning to Achieve Two Wheeled Self Balancing Control," 2016 International Computer Symposium (ICS), 2016, pp. 104-107, doi: 10.1109/ICS.2016.0029.
  • [7] L. Guo, S. A. A. Rizvi and Z. Lin, "Optimal Control of a Two-Wheeled Self-Balancing Robot by Reinforcement Q-learning," 2020 IEEE 16th International Conference on Control & Automation (ICCA), 2020, pp. 955-960, doi: 10.1109/ICCA51439.2020.9264485.
  • [8] G. Paczolay and I. Harmati, "A New Advantage Actor-Critic Algorithm For Multi-Agent Environments," 2020 23rd International Symposium on Measurement and Control in Robotics (ISMCR), 2020, pp. 1-6, doi: 10.1109/ISMCR51255.2020.9263738.
  • [9] Y. Gu, Y. Cheng, C. L. P. Chen and X. Wang, "Proximal Policy Optimization With Policy Feedback," in IEEE Transactions on Systems, Man, and Cybernetics: Systems, vol. 52, no. 7, pp. 4600-4610, July 2022, doi: 10.1109/TSMC.2021.3098451.
  • [10] Sharma, Sahil. “SAC-RL: Continuous Control of Wheeled Mobile Robot for Navigation in a Dynamic Environment.” (2020).
  • [11] Jiann-Shiun Lew and K. B. Lim, "Robust control of identified reduced-interval transfer function," in IEEE Transactions on Control Systems Technology, vol. 8, no. 5, pp. 833-841, Sept. 2000, doi: 10.1109/87.865855.
  • [12] C. Song, W. Ji, X. Gong and Z. Hu, "Research on stability for linear control system with time delay," 2011 IEEE International Conference on Mechatronics and Automation, 2011, pp. 2428-2432, doi: 10.1109/ICMA.2011.5986332.
  • [13] E. Li et al., "Model Learning for Two-Wheeled Robot Self-Balance Control," 2019 IEEE International Conference on Robotics and Biomimetics (ROBIO), 2019, pp. 1582-1587, doi: 10.1109/ROBIO49542.2019.8961382.
  • [14] F. F. Rabbany, A. Qurthobi and A. Suhendi, "Design of Self-Balancing Virtual Reality Robot Using PID Control Method and Complementary Filter," 2021 IEEE International Conference on Industry 4.0, Artificial Intelligence, and Communications Technology (IAICT), 2021, pp. 15-19, doi: 10.1109/IAICT52856.2021.9532576.
  • [15] A. S. Shekhawat and Y. Rohilla, "Design and Control of Two-wheeled Self-Balancing Robot using Arduino," 2020 International Conference on Smart Electronics and Communication (ICOSEC), 2020, pp. 1025-1030, doi: 10.1109/ICOSEC49089.2020.9215421.