Hola,

I worked on SARSA algorithm as well as on Q Learning algorithm and both of them had different Q matrix (Duh!) The methodology of both of the algorithms depicts how well one algorithm responds to future awards (which we can say OFF Policy for Q learning) while the other works of the current policy and takes an action before updating Q matrix (ON Policy).

The previous post example of the grid game showed different results when I implemented SARSA. It also involved some repetitive paths whereas Q didn't show any. A single step showed that SARSA followed the agent path and Q followed an optimal agent path.

To implement both ways I remember the way of pseudo code.

QL

initiate Q matrix.

Loop (Episodes):

Choose an initial state (s)

while (goal):

Choose an action (a) with the maximum Q value

Determine the next State (s')

Find total reward -> Immediate Reward + Discounted Reward (Max(Q[s'][a]))

Update Q matrix

s <- s'

new episode

SARSA-L

initia…

I worked on SARSA algorithm as well as on Q Learning algorithm and both of them had different Q matrix (Duh!) The methodology of both of the algorithms depicts how well one algorithm responds to future awards (which we can say OFF Policy for Q learning) while the other works of the current policy and takes an action before updating Q matrix (ON Policy).

The previous post example of the grid game showed different results when I implemented SARSA. It also involved some repetitive paths whereas Q didn't show any. A single step showed that SARSA followed the agent path and Q followed an optimal agent path.

To implement both ways I remember the way of pseudo code.

QL

initiate Q matrix.

Loop (Episodes):

Choose an initial state (s)

while (goal):

Choose an action (a) with the maximum Q value

Determine the next State (s')

Find total reward -> Immediate Reward + Discounted Reward (Max(Q[s'][a]))

Update Q matrix

s <- s'

new episode

SARSA-L

initia…