SARSA Learning with Python

I worked on SARSA algorithm as well as on Q Learning algorithm and both of them had different Q matrix (Duh!) The methodology of both of the algorithms depicts how well one algorithm responds to future awards (which we can say OFF Policy for Q learning) while the other works of the current policy and takes an action before updating Q matrix (ON Policy).

The previous post example of the grid game showed different results when I implemented SARSA. It also involved some repetitive paths whereas Q didn't show any. A single step showed that SARSA followed the agent path and Q followed an optimal agent path.

To implement both ways I remember the way of pseudo code.


initiate Q matrix.
Loop (Episodes):
   Choose an initial state (s)
   while (goal):
   Choose an action (a) with the maximum Q value
   Determine the next State (s')
   Find total reward -> Immediate Reward + Discounted Reward (Max(Q[s'][a]))
   Update Q matrix
   s <- s'
new episode


initiate Q matrix
Loop (Episodes):
   choose an initial state (s)
   while (goal):
   Take an action (a) and get next state (s')
   Get a' from s'
   Total Reward -> Immediate reward + Gamma * next Q value - current Q value
   Update Q
   s <- s' a <- a'

Here are the outputs from Q-L and SARSA-L

The above is Q-L

This one is SARSA 

There is a difference between both Q Matrix. I worked on another example by using both Q learning and SARSA. It might appear similar to mouse cliff problem for some readers so bear with me.

The code for Naruto-Q-Learning is below

Here is Hinata trying to find her way to her goal by using SARSA

The code for Hinata SARSA Learning

I used epsilon-greedy method for action prediction. I generated a random floating number between 0 to 1 and set epsilon as 0.2. If the generated number is greater than 0.2 then I select maximum Q valued action (argmax). If the generated number is less than 0.2 then I select the action (permitted)  randomly. With each episode passing by, I decreased the value of epsilon (Epsilon Decay) This will ensure that as the agent learns its way it follows the path rather than continuing exploration. Exploration is maximum at the start of the simulation and gradually decreases as each episode are passed.

This is the decay of the epsilon.

The path followed in the above simulation is 0 - 4 - 8 - 9 - 10 - 11 - 7. Sometimes the agent also follows the same path as followed during Q learning. Well, I am continuing my exploration for the same and will post more details as I learn more about RL.

Till then, bye


  1. Your blog is nice. I believe this will surely help the readers who are really in need of this vital piece of information. Thanks for sharing and kindly keep updating.

    IELTS Coaching in Adyar
    IELTS Class in Thiruvanmiyur
    IELTS Class in Triplicane
    IELTS Coaching in Anna Nagar
    IELTS Coaching Centre in Koyambedu
    IELTS Coaching Centres in Chennai Mogappair
    IELTS Classes near me

  2. nice update, thank you, please can you explain how you generate your matrices for reward and state, thanks

    1. The first row specifies
      so 0 points if out of box
      100 point if the agent lands on green box
      -1 if the agent lands on box other than green or red
      -10 if the agent lands on red box by any move specified above

      Each box has its own state number.
      Starting from top left and going horizontally from 0 to 15

      So -1 is impossible state

  3. thanks, I got it from your previous post. please can you share your email address with me?

  4. Its working.
    Can it be ported into hardware?

  5. This comment has been removed by a blog administrator.


Post a Comment