Hola,

To implement both ways I remember the way of pseudo code.

QL

initiate Q matrix.

Loop (Episodes):

Choose an initial state (s)

while (goal):

Choose an action (a) with the maximum Q value

Determine the next State (s')

Find total reward -> Immediate Reward + Discounted Reward (Max(Q[s'][a]))

Update Q matrix

s <- s'

new episode

SARSA-L

initiate Q matrix

Loop (Episodes):

choose an initial state (s)

while (goal):

Take an action (a) and get next state (s')

Get a' from s'

Total Reward -> Immediate reward + Gamma * next Q value - current Q value

Update Q

s <- s' a <- a'

Here are the outputs from Q-L and SARSA-L

The above is Q-L

The code for Naruto-Q-Learning is below

Here is Hinata trying to find her way to her goal by using SARSA

The code for Hinata SARSA Learning

This is the decay of the epsilon.

Till then, bye

I worked on SARSA algorithm as well as on Q Learning algorithm and both of them had different Q matrix (Duh!) The methodology of both of the algorithms depicts how well one algorithm responds to future awards (which we can say OFF Policy for Q learning) while the other works of the current policy and takes an action before updating Q matrix (ON Policy).

The previous post example of the grid game showed different results when I implemented SARSA. It also involved some repetitive paths whereas Q didn't show any. A single step showed that SARSA followed the agent path and Q followed an optimal agent path.

To implement both ways I remember the way of pseudo code.

QL

initiate Q matrix.

Loop (Episodes):

Choose an initial state (s)

while (goal):

Choose an action (a) with the maximum Q value

Determine the next State (s')

Find total reward -> Immediate Reward + Discounted Reward (Max(Q[s'][a]))

Update Q matrix

s <- s'

new episode

SARSA-L

initiate Q matrix

Loop (Episodes):

choose an initial state (s)

while (goal):

Take an action (a) and get next state (s')

Get a' from s'

Total Reward -> Immediate reward + Gamma * next Q value - current Q value

Update Q

s <- s' a <- a'

Here are the outputs from Q-L and SARSA-L

The above is Q-L

This one is SARSA

There is a difference between both Q Matrix. I worked on another example by using both Q learning and SARSA. It might appear similar to mouse cliff problem for some readers so bear with me.

The code for Naruto-Q-Learning is below

Here is Hinata trying to find her way to her goal by using SARSA

The code for Hinata SARSA Learning

I used epsilon-greedy method for action prediction. I generated a random floating number between 0 to 1 and set epsilon as 0.2. If the generated number is greater than 0.2 then I select maximum Q valued action (argmax). If the generated number is less than 0.2 then I select the action (permitted) randomly. With each episode passing by, I decreased the value of epsilon (Epsilon Decay) This will ensure that as the agent learns its way it follows the path rather than continuing exploration. Exploration is maximum at the start of the simulation and gradually decreases as each episode are passed.

This is the decay of the epsilon.

The path followed in the above simulation is 0 - 4 - 8 - 9 - 10 - 11 - 7. Sometimes the agent also follows the same path as followed during Q learning. Well, I am continuing my exploration for the same and will post more details as I learn more about RL.

Till then, bye

Great blog.

ReplyDeletePython Training Institute in Chennai

Your blog is nice. I believe this will surely help the readers who are really in need of this vital piece of information. Thanks for sharing and kindly keep updating.

ReplyDeleteIELTS Coaching in Adyar

IELTS Class in Thiruvanmiyur

IELTS Class in Triplicane

IELTS Coaching in Anna Nagar

IELTS Coaching Centre in Koyambedu

IELTS Coaching Centres in Chennai Mogappair

IELTS Classes near me

nice update, thank you, please can you explain how you generate your matrices for reward and state, thanks

ReplyDeleteThe first row specifies

DeleteTOP BOTTOM LEFT RIGHT

so 0 points if out of box

100 point if the agent lands on green box

-1 if the agent lands on box other than green or red

-10 if the agent lands on red box by any move specified above

Each box has its own state number.

Starting from top left and going horizontally from 0 to 15

So -1 is impossible state

thanks, I got it from your previous post. please can you share your email address with me?

ReplyDeleteIts working.

ReplyDeleteCan it be ported into hardware?

This comment has been removed by a blog administrator.

ReplyDelete