Q-Learning with Python

Currently, I am working on learning algorithms in Data Science for robotics. Reading many examples online and trying them on my own gives me a feeling of reward. I got deeply fascinated by Q learning algorithm based on the Bellmans equation. I also made a Pong game using Q learning. You can view that project on my instructable.

It didn' take much time to understand the working of Q learning. It appeared similar to the State Space matrix that I studied in my Control Systems class in college which I have forgotten now. However, seeing a practical application makes it easier to learn.

Q-Learning is based on State-Action-Reward strategy. For example, every state has various actions that can be implemented in that state and we have to choose the action which returns maximum rewards for us.

The agent will roam around like a maniac at the start and learn about its actions and rewards. The next time when the agent faces the similar state, it will know what to do in order to minimize the loss and maximize the reward.

The basic equation of Q learning algorithm is

Q(s,a) = Q(s,a) + X*(R(s,a) +  Y * (Max(s',a)) - Q(s,a))

This algorithm follows Off Policy algorithm. The Q value is revised using the next state S' and next action A' based on next state. It basically increases its probability or Q value by adding a discounted reward from the next state that is yet to happen. I guess this is the reason why many other fellows also call it sometimes "greedy".

Here Q is the Q Matrix or better say the brain of our agent. R is the reward matrix which stores the reward for every step taken i.e. reward returned from a taken action in the particular state.

s' is the next state after the action is taken.
X is the learning rate. Closer to 1 means doing no mistakes at all which is superficial. Closer to 0 means no learning at all which we don't want at all.
Y is the discount factor. It tells the agent how much far it has to look. It can be understood with an example. The more importance you give to future rewards, the more will be the discount factor. The more you value near rewards, the less is the discount factor.

0 <= X,Y <= 1

I began with a 4 X 4 matrix like this

The green square is the goal with reward 100. The red square is danger and the agent has to avoid or else -10 reward rest all squares can be used for locomotion with a reward of -1.

I assigned each square with a state number. The actions of the agent will be 0, 1, 2, 3 where 0 is for UP, 1 is for DOWN, 2 is for LEFT, 3 is for RIGHT

This is my reward matrix
reward = np.array([[0, -10, 0, -1],
                   [0, -1, -10, -1],
                   [0, -1, -1, 10],
                   [0, -10, -1, -1],

                   [-10, -1, 0, -1],
                   [-1, -1, -10, -1],
                   [1, -10, -1, -10],
                   [100, -1, -1, 0],

                   [-10, -1, 0, -1],
                   [-1, -1, -1, -10],
                   [-1, -1, -1, -1],
                   [-10, -1, -10, 0],

                   [-1, 0, 0, -1],
                   [-1, 0, -1, -1],
                   [-10, 0, -1, -1],
                   [-1, 0, -1, 0]])

Each row is the state and each column is the action taken in that state. The value indicated is the reward for a particular action in a particular state. Here 0 means invalid reward i.e. that action si not valid. We cannot go UP or LEFT from state 0. Thus for state 0 reward matrix will be [0, -10, 0, -1]. The blueprint is [UP, DOWN, LEFT, RIGHT] so if we jump UP from state 0 it is not possible thus reward is 0. If we jump DOWN we land to red square thus reward is -10. If we move RIGHT we get -1 as our reward. Thus the reward matrix has a total of 16 states where each state has 4 actions.

The next state matrix is this

n_s = np.array([[-1,4,-1,1],




We have 16 states with 4 actions and each value represents the next state when an action is taken. Here -1 represents invalid state i.e. not possible.

The Action matrix is like this
action = np.array([[0,3],
                  [1, 2, 3],
                  [1, 2, 3],  "State 2 can have DOWN, LEFT, RIGHT"
                  [1, 2],



                   [0,2,3],  "State 13 can have DOWN, LEFT, RIGHT"

Q matrix is initialized as 16 X 4 matrix (16 states and 4 actions)
Q matrix stores the experience rewards. For example, Q[12][3] will represent the experience of when the agent was in state 12 and his action is 3 (LEFT). I understand this concept with probability. Greater is the value, greater is the probability of higher reward.

Here i_state represents the initial state of the agent which is 11.
i_s = np.array([[1,2,5,6,8,9,11,12,13,14,15]])
This is an array that stores the states from where agent can begin his training.

I began with 10 episodes as it was enough for my agent to learn. The loop begins with choosing a random state from i_s matrix at line 77.  Our goal is state 3 thus we run the agent until he reaches his goal. Line 81 describes the for loop which finds the maximum Q value (probability) among all possible actions in that state. Line 85: We find the next state based on the action taken. Line 89: We find maximum Q value of all possible actions of the next state. Line 93: We use the Q-Algorithm equation to find the value of that particular Q matrix location. Line94: All movements of the agent is recorded inside this matrix. Line 95: The next state becomes the current state. When the goal is met a new episode is started.

Let the starting position of the agent be 11. Thus now we apply the policy in this state. The possible actions in state 11 are TOP, LEFT, DOWN.
 Let Q[11] be [1, 3, 2, 5]
for i in action[i_state]:
if Q[i_state][i]>Qx:
act = i
Qx = Q[i_state][i]
n_state = n_s[i_state][act]
action[11] is [0, 1, 3]
Thus if Q[11][0] > Qx(-999) then act will be 0
Qx will be updated. When this loops runs act will be 3 because the maximum value among Q[11][0] = 1, Q[11][1] = 3, Q[11][3] = 5 is 5 which has index 3 thus it means the action that will reward maximum Q value in state 11 is 3.
 With this action value we can get out next state and that will be 10 i.e. the agent moves LEFT.

for i in action[n_state]:
Max = max(nxt_values)

Now we have the next state as 10. Now we will calculate the maximum Q value is that state for all possible actions and store it in Max.
 Multiplying this value (Max) with discount factor gives the future reward which we add with the immediate reward. The immediate reward is calculated with R matrix. The sum of immediate reward and the discounted reward is multiplied by the learning rate and added with Q value at that state: action location. 

Each episode has a total reward as calculated at Line 92.
The xd matrix is [11, 15, 14, 13, 9, 5, 6, 2, 3]. This is the optimal pathway of the agent to reach the goal.

So Long

EDIT: There was an error in the action matrix. It has been corrected


Popular posts from this blog

SPI Working with Verilog Code

Verilog Code for I2C Protocol

SR Flip Flop Verilog Code