This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

A Comparison of Reward Functions in Q-Learning Applied to a Cart Position Problem

Amartya Mukherjee1
1Department of Applied Mathematics, University of Waterloo
(May 2021)
Abstract

Growing advancements in reinforcement learning has led to advancements in control theory. Reinforcement learning has effectively solved the inverted pendulum problem [1] and more recently the double inverted pendulum problem [2]. In reinforcement learning, our agents learn by interacting with the control system with the goal of maximizing rewards. In this paper, we explore three such reward functions in the cart position problem. This paper concludes that a discontinuous reward function that gives non-zero rewards to agents only if they are within a given distance from the desired position gives the best results.

Keywords: cart position problem, control theory, reinforcement learning, reward function

1 Introduction

Reinforcement Learning (RL) is a branch of Machine Learning that deals with agents that learn from interacting with an environment. This is inspired by the trial-and-error method of learning dealt with in psychology [1]. The goal of any RL problem is to maximize the total reward, and the reward is calculated using a reward function. At any point, the goal of an RL agent is to choose an action that maximizes not only its immediate reward, but also the total reward it expects to get if it follows a certain sequence of actions. The RL algorithm used in this paper is Q-Learning. It is a basic algorithm used in RL, and the math used in it is easy to understand for a reader that does not have much machine learning background. It also comes with the merit that training a Q-Learning algorithm is significantly faster compared to other commonly used RL algorithms.

The cart position problem is a toy problem that will be explored in this paper. The goal is to move a cart from a position x=0x=0 to a position x=rx=r. This will be done by adjusting the voltage of the cart at every time step. Currently there exists theoretical input functions that solve the cart position problem. It will be interesting to see how RL compares with these functions.

The objective of this paper is to compare three different reward functions based on the performance of the RL agent. The comparison is done in three ways. First, we see which RL agent reaches steady-state motion the earliest. Second, we see which RL agent has the lowest variability in its steady-state motion. Third, we see which RL agent is the closest to x=rx=r in its steady-state motion.

This paper is organized as follows. In section 2, we describe the cart position problem. In section 3, we present the RL algorithm and the reward functions we intend to use. In section 4, the theoretical solution to the cart position problem is explained. In section 5, we present the results of the RL algorithm and compare them. In section 6, we discuss the applicability of the RL algorithm in real life situations.

2 Cart Position Problem

This paper concerns the control of the position of a cart by a rotary motor. Let x(t)x(t) be the position of a cart at time tt and V(t)V(t) the motor voltage at time tt. ηg\eta_{g} is the gearbox efficiency, KgK_{g} the gearbox gear ratio, ηm\eta_{m} the motor efficiency, KtK_{t} the motor torque constant, rmpr_{mp} the motor pinion radius, KmK_{m} the back-EMF constant, RmR_{m} the motor armature resistance, ν\nu the coefficient of viscous friction, and MM the mass of the cart. These are all constants. The equation for this control system is a second-order differential equation shown in equation 1 below:

Mx¨(t)=ηgKgηmKt(rmpV(t)KgKmx˙(t))Rmrmp2νx˙(t)M\ddot{x}(t)=\frac{\eta_{g}K_{g}\eta_{m}K_{t}(r_{mp}V(t)-K_{g}K_{m}\dot{x}(t))}{R_{m}r_{mp}^{2}}-\nu\dot{x}(t) (1)

Throughout the course of this paper, we can define constants α\alpha and β\beta as:

α=ηgKg2ηmKtKmMRmrmp2νM\alpha=-\frac{\eta_{g}K_{g}^{2}\eta_{m}K_{t}K_{m}}{MR_{m}r_{mp}^{2}}-\frac{\nu}{M}
β=ηgKgηmKtrmpMRmrmp2\beta=\frac{\eta_{g}K_{g}\eta_{m}K_{t}r_{mp}}{MR_{m}r_{mp}^{2}}

This simplifies our governing equation to equation 2 below:

x¨(t)=αx˙(t)+βV(t)\ddot{x}(t)=\alpha\dot{x}(t)+\beta V(t) (2)

Where α<0\alpha<0 and β>0\beta>0. In the absence of any control input V(t)V(t), the solution to the differential equation is:

x(t)=C1eαt+C2x(t)=C_{1}e^{\alpha t}+C_{2}

In this model, α<0\alpha<0, so limtx(t)=C2\lim_{t\to\infty}x(t)=C_{2}. The steady-state position is C2C_{2}. The goal of the input function V(t)V(t) is to change the steady-state position to a position of our choice.

We can express our governing equation as a system of first order differential equations. Let s(t)=x˙(t)s(t)=\dot{x}(t) represent the velocity of the cart. The system of first order equations is expressed below:

x˙(t)=s(t)\dot{x}(t)=s(t)
s˙(t)=αx(t)+βV(t)\dot{s}(t)=\alpha x(t)+\beta V(t)

Let X(t)=[x(t)s(t)]X(t)=\begin{bmatrix}x(t)\\ s(t)\end{bmatrix}. This system of first order equations can be expressed in matrix form, as shown below:

X˙(t)=[010α]X(t)+[0β]V(t)\dot{X}(t)=\begin{bmatrix}0&1\\ 0&\alpha\end{bmatrix}X(t)+\begin{bmatrix}0\\ \beta\end{bmatrix}V(t)

We also know that the objective of our control problem is to drive the cart to a particular position. For this reason, we want our output y(t)y(t) to be the position x(t)x(t). To express this in a matrix form involving X(t)X(t), this can be written as:

y(t)=[10]X(t)y(t)=\begin{bmatrix}1&0\end{bmatrix}X(t)

The state-space form of our equation is:

X˙(t)=[010α]X(t)+[0β]V(t)\dot{X}(t)=\begin{bmatrix}0&1\\ 0&\alpha\end{bmatrix}X(t)+\begin{bmatrix}0\\ \beta\end{bmatrix}V(t) (3a)
y(t)=[10]X(t)+[0]V(t)y(t)=\begin{bmatrix}1&0\end{bmatrix}X(t)+[0]V(t) (3b)

Where X(t)X(t) is the state, V(t)V(t) is the input, y(t)y(t) is the output. The objective of the cart position problem is to find an input function V(t)V(t) so that the steady-state value of xx is rr. And this will be done using RL.

3 Use of RL in the Cart Position Problem

The initial conditions used in the control system are: x(0)=x˙(0)=0x(0)=\dot{x}(0)=0. In this paper, we will let α=1,β=10,r=10\alpha=-1,\beta=10,r=10. The solution to Equation 3b is computed numerically using Euler’s method with a step-size of δt=0.2\delta t=0.2s.

At each time step, the RL agent takes the output of the control system, which is the position x(t)x(t) of the cart, and returns the voltage V(t)V(t) that should be used as the input to the control system at that time step. The voltage V(t)V(t) is an integer between 5-5 and 55. In order to ensure that the cart does not go too far from x=rx=r, the bounds of xx has been set to [10,20][-10,20].

The training process involves running 100100 samples. Each sample has 3030 rounds. When a round starts, the cart is at position x=0x=0. The round ends if the cart goes out of bounds (i.e. x[10,20]x\not\in[-10,20]) or 5050 time steps have passed since the start of the round. This means the training process of the RL agent involves observing at most 150000150000 time steps.

The objective of the RL agent is to find a sequence of voltage inputs such that the total reward is maximized.

3.1 Q-Learning

The RL algorithm used in this paper is Q-Learning. Q-Learning was first coined by Watkins (1989) in his PhD thesis [3]. It finds an optimal policy Q(s,a)Q(s,a) where ss is the state and aa is the action. This optimal policy maximizes the total reward in the system. Q(s,a)Q(s,a) can be thought of a table of values for all states and actions.

At first, Q(s,a)Q(s,a) is initialized to 0 for all s,as,a. At every time step tt, the action ata_{t} is determined from the state sts_{t} through an ϵ\epsilon-greedy process. In order to find the best action ata_{t} from Q(s,a)Q(s,a), we must first use a random process to figure out the result of the action. When the training of the agent starts, we let ϵ=1\epsilon=1. At every time step, we take a random sample from the Uniform(0,1)Uniform(0,1) distribution. If the random sample is greater than ϵ\epsilon, then at=argmaxa(Q(st,a))a_{t}=\operatorname*{arg\,max}_{a}(Q(s_{t},a)). Otherwise, ata_{t} is a randomly selected action.

Let Q[n](s,a)Q^{[n]}(s,a) be the policy after nn updates. When st+1s_{t+1} is calculated using sts_{t} and ata_{t}, then Q[n](s,t)Q^{[n]}(s,t) is updated using the following equation:

Q[n+1](st,at)=Q[n](st,at)+ζ(rt+γmaxaQ(st+1,a)Q(st,at))Q^{[n+1]}(s_{t},a_{t})=Q^{[n]}(s_{t},a_{t})+\zeta(r_{t}+\gamma\max_{a}Q(s_{t+1},a)-Q(s_{t},a_{t})) (4)

Where rtr_{t} is the reward, ζ\zeta is the learning rate. maxaQ(st+1,a)\max_{a}Q(s_{t+1},a) is a measure of the guess of the future reward, and γ\gamma is a measure of the weight we give to future rewards. ζ\zeta and γ\gamma are hyperparameters to this agent. The hyperparameters used in this paper are: ζ=0.05,γ=0.9\zeta=0.05,\gamma=0.9

In Q(s,a)Q(s,a), the state ss refers to the position xx of the cart, and the action aa refers to the voltage VV of the cart.

The Q-Learning code used in this paper is taken from Lin’s (2018) GitHub repository. This code was originally used to solve the Blackjack problem using Q-Learning. The equation for updating ϵ\epsilon after every time step tt according to Lin’s code is given by the equation below [4]:

ϵt+1={ϵtϵ0/(3ntrain)ntraint>0.7ntrain or 0.3ntrain>ntraint>0ϵt2ϵ0/ntrain0.7ntrain>ntraint>0.3ntrain0Otherwise\epsilon_{t+1}=\begin{cases}\epsilon_{t}-\epsilon_{0}/(3n_{train})&n_{train}-t>0.7n_{train}\text{ or }0.3n_{train}>n_{train}-t>0\\ \epsilon_{t}-2\epsilon_{0}/n_{train}&0.7n_{train}>n_{train}-t>0.3n_{train}\\ 0&\text{Otherwise}\end{cases} (5)

Where ntrain=30000n_{train}=30000 is the number of time steps used to train the agent.

3.2 Reward Functions

As shown previously, reward functions are used to update the policy function at each time step with the intention that the RL agent maximizes the total reward. This paper will explore three reward functions. The first reward function takes the negative of the square of the distance between xx and rr with the intention that the RL agent will be rewarded better if it is closer to rr.

R(x)=(xr)2R(x)=-(x-r)^{2} (6)

The second reward function is piece-wise and is always positive. This function is linear instead of quadratic.

R(x)={r2|xr|r2<x<3r20OtherwiseR(x)=\begin{cases}\frac{r}{2}-|x-r|&\frac{r}{2}<x<\frac{3r}{2}\\ 0&\text{Otherwise}\end{cases} (7)

The third reward function is discontinuous. It only rewards a 11 if the cart is a distance of less than 11 unit away from rr and it rewards a 55 if the cart is a distance of less than 0.10.1 units away from rr.

R(x)={1x[r1,r0.1][r+0.1,r+1]5x(r0.1,r+0.1)0OtherwiseR(x)=\begin{cases}1&x\in[r-1,r-0.1]\cup[r+0.1,r+1]\\ 5&x\in(r-0.1,r+0.1)\\ 0&\text{Otherwise}\\ \end{cases} (8)

To reproduce our results, we provided the code used for this problem [7].

4 Theoretical solution

Consider the following voltage function:

V(t)=Kp(rx(t))V(t)=K_{p}(r-x(t)) (9)

Where Kp>0K_{p}>0. Substituting the voltage function into equation 2 gives us the following:

x¨(t)=αx˙(t)+βKp(rx(t))\ddot{x}(t)=\alpha\dot{x}(t)+\beta K_{p}(r-x(t)) (10)

This equation gives us a solution x(t)x(t) whose steady-state position is x=rx=r [6]. The proof is given in Appendix A. Numerical simulations of Equation 10 have been done using Euler’s method with Kp=0.2K_{p}=0.2 and Kp=0.1K_{p}=0.1. The plot of the two trajectories are given below:

Refer to caption
Figure 1: Trajectory of the cart using the theoretical solution with Kp=0.2K_{p}=0.2
Refer to caption
Figure 2: Trajectory of the cart using the theoretical solution with Kp=0.1K_{p}=0.1

With Kp=0.2K_{p}=0.2, the position of the cart is within [r1,r+1][r-1,r+1] after 3535 time steps (or 77 seconds). With Kp=0.1K_{p}=0.1, the position of the cart is within [r1,r+1][r-1,r+1] after 2323 time steps (or 4.64.6 seconds).

This theoretical solution will be compared to the solution derived from Q-Learning.

5 Results of Q-Learning

In this section, we train three RL agents using each of the reward functions provided in section 3.2. We then compare the results of each of our RL agents.

5.1 Reward Function 1

The first reward function takes the negative of the square of the distance between xx and rr with the intention that the RL agent will be rewarded better if it is closer to rr, as shown in Equation 6. Figure 3 below shows how the average reward varies with each sample. The average reward is calculated by taking the mean of the total reward of each of the trajectories in the sample.

Refer to caption
Figure 3: Average reward plot for reward function 1

It is clear that, after training for 7070 samples, the training is complete and the average reward is approximately 468-468. Figure 4 shows the trajectory of each cart in sample 1. These carts follow a random motion. All trajectories of carts in later samples will be compared to this plot.

Refer to caption
Figure 4: Trajectory of each cart at sample 1 for reward function 1

In this graph, most of the carts go out of bounds within the first 10 time steps. During the first sample, the value of ϵ\epsilon used in the ϵ\epsilon-greedy algorithm is almost 11, which is why the voltage applied at each time step is picked randomly. Thus, the trajectory that each cart takes is random.

At samples 6060 to 7070, the average reward plot in Figure 3 shows an increasing trend. Figure 5 shows the trajectory of each cart at sample 6565.

Refer to caption
Figure 5: Trajectory of each cart at sample 65 for reward function 1

In this plot, clearly the RL agent has learned that the cart needs to move towards x=rx=r during the start of the round. The trajectories are less likely to move out of bounds. Several trajectories appear to move back and forth x=rx=r. It shows a significant improvement compared to Figure 4.

Figure 6 shows the trajectory of each cart in sample 100100.

Refer to caption
Figure 6: Trajectory of each cart at sample 100 for reward function 1

This graph shows that the cart reaches a steady-state motion after 55 time steps (or 1.01.0 seconds). The position alters between x=10.0x=10.0 and x=8.0x=8.0 and the voltage alters between V=1V=-1 and V=1V=1. The total reward in this trajectory is 468-468, which is significantly high compared to the average rewards in the first few samples shown in Figure 3.

The steady-state position is within ±2\pm 2 of rr. If V=0V=0 at x=10x=10, then the cart is expected to drift past x=rx=r, which is why V=1V=-1 is the best action to take here.

5.2 Reward Function 2

The second function gives a positive reward only if the position of the cart is in the range [r2,3r2]=[5,15][\frac{r}{2},\frac{3r}{2}]=[5,15], as shown in Equation 7. Contrary to reward function 1 and 3, the step-size used here is 0.10.1s because it leads to better results, and the number of time steps in a round is still 5050. Figure 7 below shows how the average reward varies with each sample.

Refer to caption
Figure 7: Average reward plot for reward function 2

It is clear that, after training for 4040 samples, each trajectory of the cart has a total reward of 8686. Figure 8 shows the trajectory of each cart in sample 1. These carts follow a random motion. All trajectories of carts in later samples will be compared to this plot. The distance from origin is the value x(t)x(t) at time tt and the time step is t/δtt/\delta t.

Refer to caption
Figure 8: Trajectory of each cart at sample 1 for reward function 2

In this graph, most of the carts go out of bounds within the first 40 time steps. This makes sense since the time step size used here is half the step size used for the first reward function. For this reason, the trajectories shown in this graph are different from the trajectories shown in Figure 4. The trajectory that each cart takes is random because the voltage applied at each time step is picked randomly.

At samples 2020 to 3030, the average reward in Figure 7 shows an increasing trend. Figure 9 shows the trajectory of each cart in sample 25.

Refer to caption
Figure 9: Trajectory of each cart at sample 25 for reward function 2

In this plot, it is clear the cart has learned to move towards x=rx=r during the start of the round. Several trajectories appear to stay near x=rx=r and the cart is less likely to go out of bounds. It shows a significant improvement compared to Figure 8.

Figure 10 shows the trajectory of each cart in sample 100.

Refer to caption
Figure 10: Trajectory of each cart at sample 100 for reward function 2

This graph shows that the cart reaches a steady-state motion after 1010 time steps (or 1.01.0 seconds). The validation plot is the same as this plot. The position alters between x=12.0x=12.0, x=13.0x=13.0 and x=17.0x=17.0 and the voltage alters between V=1V=1, V=4V=4 and V=5V=-5. The total reward in this trajectory is 86.086.0, which is significantly high compared to the average rewards in the first few samples shown in Figure 7.

However, the variation in the position during the steady-state motion is too big. The steady-state position is within ±7\pm 7 of rr. And rr is not in the range of the positions in the steady-state motion. This is because the agent has learned that a positive reward comes if the cart is in the range [5,15][5,15]. Thus, there is less of an incentive to stay closer to x=rx=r since the current trajectory already significantly maximizes the total reward. This shows that the second reward function is not as reliable as the first reward function.

5.3 Reward Function 3

As a measure of ensuring that the variability of the steady-state position is minimized, the third reward function gives a positive reward only if the position of the cart is in the range [r1,r+1]=[9,11][r-1,r+1]=[9,11], as shown in Equation 8. Figure 11 shows how the average reward varies with each sample.

Refer to caption
Figure 11: Average reward plot for reward function 3

It is clear that, after training for 8585 samples, each trajectory of the cart has a total reward of 230230. Figure 12 shows the trajectory of each cart in sample 1. These carts follow a random motion. All trajectories of carts in later samples will be compared to this plot.

Refer to caption
Figure 12: Trajectory of each cart at sample 1 for reward function 3

In this graph, most of the carts go out of bounds within the first 10 time steps. This graph shows similar trends to Figure 4. The direction each cart goes at the start of the round is random as the voltage applied at each time step is picked randomly.

At samples 8080 to 8585, the average reward in Figure 11 shows an increasing trend. Figure 13 shows the trajectory of each cart in sample 8484.

Refer to caption
Figure 13: Trajectory of each cart at sample 84 for reward function 3

In this plot, it is clear that the agent has learned to move the cart towards x=rx=r during the start of the round. Trajectories are less likely to go out of bounds. In some trajectories, the cart moves back and forth the interval [r1,r+1][r-1,r+1] multiple times in order to increase its total reward.

Figure 14 shows the trajectory of each cart in sample 100.

Refer to caption
Figure 14: Trajectory of each cart at sample 100 for reward function 3

This graph shows that the cart reaches exactly x=rx=r after 44 time steps (0.80.8 seconds) and stays there. The validation plot is the same as this plot. The steady-state motion has no variability compared to the first and second reward function. On the other hand, a drawback of using this reward function is that training this agent takes more samples compared to training RL agents that use the first or second reward function, as shown in Figure 11.

We first compare which RL agent reaches steady-state motion the earliest. For the first and second reward function, the Q-Learning agents take equally as long to reach steady-state motion (1.01.0s). The Q-Learning agent using the third reward function reaches steady-state motion the quickest (0.80.8s).

We then compare which RL agent has the lowest variability in its steady-state motion. In the first reward function, the steady-state motion alternates between positions 8.08.0 and 10.010.0, thus having a width of 2.02.0. In the second reward function, the steady-state motion alternates between positions 12.0,13.012.0,13.0 and 17.017.0, thus having a width of 5.05.0. In the third reward function, the steady-state motion just takes values of 10.010.0, thus having a width of 0.00.0. This shows that, while the first reward function leads to a smaller width compared to the second reward function, the third reward function has the smallest width.

Lastly, we compare which RL agent is closest to x=rx=r in its steady-state motion. Clearly, the third reward function performs the best since it’s steady state position is exactly rr. And the second reward function performs the worst since rr is not in its set of steady-state positions.

Thus, the third reward function gives the best results.

6 Discussion

While the previous section discussed the steady-state behaviour of the Q-Learning agent for each of the reward functions, this section will discuss the practicality of Q-Learning agents on real carts based on the results of the third reward function.

We first compare the trajectory of the Q-Learning agent shown in Figure 14 with the trajectories of the theoretical solutions shown in Figure 1 and 2. It is clear that the Q-Learning agent shows better performance compared to the theoretical solutions as the Q-Learning agent reached steady-state motion within 0.80.8 seconds and the theoretical solutions reached a position within ±1\pm 1 of x=rx=r within 77 seconds and 4.64.6 seconds.

6.1 Applicability of Q-Learning on real carts

In this paper, the cart position problem is thought of as a toy problem where Q-Learning may be useful. Suppose a real cart follows the model shown in Equation 1 and its goal is to get from a position x=0x=0 to x=rx=r. This means the position is updated continuously as opposed to time steps shown in Euler’s method. In Q-Learning, the policy function Q(s,a)Q(s,a) picks actions based on trial-and-error. The set of states ss in Q(s,a)Q(s,a) are finite and discrete. Thus, Q-Learning is impractical on a real cart where the position can be any real number. In this situation, the theoretical solution is more effective as it adjusts the voltage based on a continuous set of positions.

Having a discrete set of actions can also be undesirable in this problem. In the situation with reward function 1 (Figure 6), it is clear that VV has to alter between 1-1 and 11 in order to maintain a steady-state motion. If V=0V=0 at x=10x=10, then the cart will drift away from x=10x=10 since its velocity is non-zero. Thus, further areas of exploration involve adding values of VV with a smaller magnitude (i.e. 0.1,0.20.1,0.2) into the range of actions the Q-learning agent could take.

Further areas of exploration also involve using Deep Q learning. Artificial neural networks (ANNs) are trained by modifying weights rather than entries in a Q(s,a)Q(s,a) table, thus, ANNs may be more reliable in problems involving continuous states or continuous actions.

Consider the trajectory of the Q-Learning agent shown in Figure 14. Between time t=0t=0 and t=δt=0.2t=\delta t=0.2, the velocity of the cart increases from x˙=0ms1\dot{x}=0ms^{-1} to x˙=40ms1\dot{x}=40ms^{-1}. This means the acceleration is approximately x¨=200ms2\ddot{x}=200ms^{-2}. In a real cart, this could be dangerously high, given that cars that have an acceleration of approximately 10ms210ms^{-2} to 20ms220ms^{-2} are considered one of the fastest accelerating cars [5]. This shows that the Q-Learning agent is impractical on real carts.

Lastly, in real world carts, there is always an error associated with measuring the mass of the cart, the coefficient of friction, or any of the constants described in Equation 1. This means there can be an error associated with measuring α\alpha or β\beta. Suppose we trained a Q-Learning agent with α=1\alpha=-1 and we intend to test it on a cart with α=1.01\alpha=-1.01. Figure 15 shows what the trajectory will look like.

Refer to caption
Figure 15: Trajectory of a cart with α=1.01\alpha=-1.01

If we use a cart with α=1\alpha=-1, we get the trajectory shown in Figure 14. If we use a cart with α=1.01\alpha=-1.01, we get the trajectory shown in Figure 15. This shows that a small error in the measurement of α\alpha can lead to a significant change in the trajectory of the cart. This occurs because the set of states ss in Q(s,a)Q(s,a) are finite and discrete. This is why a small change in α\alpha means that the cart will reach positions that are either not in the set of states in Q(s,a)Q(s,a), or have a different action associated with it. This again shows that Q-Learning is not reliable for real carts.

7 Conclusion

This paper compared three different reward functions based on the performance of the RL agent if we intend to use Q-learning to solve the cart position problem. In conclusion, a discontinuous reward function that rewards the RL agent only if the position of the cart lies in [r1,r+1][r-1,r+1] gives the best results. Through this analysis, we also created a RL agent that outperforms theoretical solutions.

Acknowledgements

The author gratefully acknowledges Jun Liu and Milad Farsi for their continued support, feedback and suggestions on this paper.

Appendix A: Proof of the theoretical solution

Consider the following input function:

V(t)=Kp(rx(t))V(t)=K_{p}(r-x(t))

Substituting the voltage function into equation 2, gives us the following:

x¨(t)=αx˙(t)+βKp(rx(t))\ddot{x}(t)=\alpha\dot{x}(t)+\beta K_{p}(r-x(t))

Re-arranging this gives us a non-homogenous second order differential equation:

x¨(t)αx˙(t)+βKpx(t)=βKpr\ddot{x}(t)-\alpha\dot{x}(t)+\beta K_{p}x(t)=\beta K_{p}r (11)

The goal of this subsection is to find the steady-state position limtx(t)\lim_{t\rightarrow\infty}x(t). This involves finding a general solution to equation 11. This will be done by finding the general solution homogenous version of the equation, xh(t)x_{h}(t) and by finding a particular solution to the non-homogenous version of the equation, xp(t)x_{p}(t).

A.1: Solve the homogenous equation

x¨h(t)αx˙h(t)+βKpxh(t)=0\ddot{x}_{h}(t)-\alpha\dot{x}_{h}(t)+\beta K_{p}x_{h}(t)=0

Writing this in matrix form with Xh(t)=[xh(t)x˙h(t)]X_{h}(t)=\begin{bmatrix}x_{h}(t)\\ \dot{x}_{h}(t)\end{bmatrix} gives the following equation:

Xh˙(t)=[01βKpα]Xh(t)=:AhXh(t)\dot{X_{h}}(t)=\begin{bmatrix}0&1\\ -\beta K_{p}&\alpha\end{bmatrix}X_{h}(t)=:A_{h}X_{h}(t)

The solution to this equation is:

Xh(t)=exp(Aht)Xh(0)X_{h}(t)=\exp(A_{h}t)X_{h}(0)

The eigenvalues of AhA_{h} are:

α2±α24βKp\frac{\alpha}{2}\pm\sqrt{\frac{\alpha^{2}}{4}-\beta K_{p}}

Since β>0\beta>0, this means:

Re(α24βKp)<|α2|Re(\sqrt{\frac{\alpha^{2}}{4}-\beta K_{p}})<|\frac{\alpha}{2}|

And since α<0\alpha<0, this means:

Re(α24βKp)<α2Re(\sqrt{\frac{\alpha^{2}}{4}-\beta K_{p}})<-\frac{\alpha}{2}
Re(±α24βKp)<α2Re(\pm\sqrt{\frac{\alpha^{2}}{4}-\beta K_{p}})<-\frac{\alpha}{2}
Re(α2±α24βKp)<0Re(\frac{\alpha}{2}\pm\sqrt{\frac{\alpha^{2}}{4}-\beta K_{p}})<0

This shows that AhA_{h} is Hurwitz, and by extension:

limtXh(t)=[00]\lim_{t\rightarrow\infty}X_{h}(t)=\begin{bmatrix}0\\ 0\end{bmatrix}

Thus, the steady-state value of the homogenous equation is: xh=0x_{h}=0.

A.2: Find a particular solution to the non-homogenous equation

Consider:

xp(t)=rx_{p}(t)=r

Substituting it into equation 2 gives:

x¨(t)αx˙(t)+βKpx(t)=0+0+βKpr=βKpr\ddot{x}(t)-\alpha\dot{x}(t)+\beta K_{p}x(t)=0+0+\beta K_{p}r=\beta K_{p}r

Thus, xp(t)=rx_{p}(t)=r is a particular solution to the non-homogenous equation. The steady-state value of xp(t)x_{p}(t) is rr.

A.3: Steady-state value of equation 11

All steady-state values of the homogenous equation are 0 and the steady-state solution to the particular solution of the non-homogenous equation is rr. This means the steady-state value of equation 11 is rr.

limtx(t)=r\lim_{t\rightarrow\infty}x(t)=r

This shows that using the input function V(t)=Kp(rx(t))V(t)=K_{p}(r-x(t)) is effective in moving the cart from position x=0x=0 to position x=rx=r.

References

  • [1] Sutton RS, Barto AG. (1998) Reinforcement Learning: an Introduction . MIT Press.
  • [2] Zheng Y, Luo S,Lv Z. (2006). Control Double Inverted Pendulum by Reinforcement Learning with Double CMAC Network. Proceedings - International Conference on Pattern Recognition. 4. 639 - 642. 10.1109/ICPR.2006.416.
  • [3] Watkins, C.J.C.H. (1989), Learning from Delayed Rewards (Ph.D. thesis), Cambridge University
  • [4] Lin C (2018), Blackjack–Reinforcement-Learning, GitHub Repository https://github.com/ml874/Blackjack–Reinforcement-Learning
  • [5] Autocar (2020), The top fastest-accelerating cars in the world 2021, https://www.autocar.co.uk/car-news/best-cars/top-fastest-accelerating-cars-world
  • [6] Farsi, M (2021), Lab 1: Cart Position, AMATH 455/655, University of Waterloo
  • [7] Mukherjee, A (2021), Q Learning Cart Problem, GitHub Repository https://github.com/amartyamukherjee/ReinforcementLearningCartPosition