### SARSA Learning with Python

Hola,
I worked on SARSA algorithm as well as on Q Learning algorithm and both of them had different Q matrix (Duh!) The methodology of both of the algorithms depicts how well one algorithm responds to future awards (which we can say OFF Policy for Q learning) while the other works of the current policy and takes an action before updating Q matrix (ON Policy).

The previous post example of the grid game showed different results when I implemented SARSA. It also involved some repetitive paths whereas Q didn't show any. A single step showed that SARSA followed the agent path and Q followed an optimal agent path.

To implement both ways I remember the way of pseudo code.

QL

initiate Q matrix.
Loop (Episodes):
Choose an initial state (s)
while (goal):
Choose an action (a) with the maximum Q value
Determine the next State (s')
Find total reward -> Immediate Reward + Discounted Reward (Max(Q[s'][a]))
Update Q matrix
s <- s'
new episode

SARSA-L

initiate Q matrix
Loop (Episodes):
choose an initial state (s)
while (goal):
Take an action (a) and get next state (s')
Get a' from s'
Total Reward -> Immediate reward + Gamma * next Q value - current Q value
Update Q
s <- s' a <- a'

Here are the outputs from Q-L and SARSA-L

The above is Q-L

This one is SARSA

There is a difference between both Q Matrix. I worked on another example by using both Q learning and SARSA. It might appear similar to mouse cliff problem for some readers so bear with me.

The code for Naruto-Q-Learning is below

Here is Hinata trying to find her way to her goal by using SARSA

The code for Hinata SARSA Learning

I used epsilon-greedy method for action prediction. I generated a random floating number between 0 to 1 and set epsilon as 0.2. If the generated number is greater than 0.2 then I select maximum Q valued action (argmax). If the generated number is less than 0.2 then I select the action (permitted)  randomly. With each episode passing by, I decreased the value of epsilon (Epsilon Decay) This will ensure that as the agent learns its way it follows the path rather than continuing exploration. Exploration is maximum at the start of the simulation and gradually decreases as each episode are passed.

This is the decay of the epsilon.

The path followed in the above simulation is 0 - 4 - 8 - 9 - 10 - 11 - 7. Sometimes the agent also follows the same path as followed during Q learning. Well, I am continuing my exploration for the same and will post more details as I learn more about RL.

Till then, bye

1. This comment has been removed by a blog administrator.

2. nice update, thank you, please can you explain how you generate your matrices for reward and state, thanks

1. The first row specifies
TOP BOTTOM LEFT RIGHT
so 0 points if out of box
100 point if the agent lands on green box
-1 if the agent lands on box other than green or red
-10 if the agent lands on red box by any move specified above

Each box has its own state number.
Starting from top left and going horizontally from 0 to 15

So -1 is impossible state

4. Its working.
Can it be ported into hardware?

5. This comment has been removed by a blog administrator.

6. I got wonderful information from this blog. Thanks for sharing this post. it becomes easy to read and understand the information..
Airport Management Training in Chennai
Airport Ground Staff Training in Chennai