Expected SARSA in Reinforcement Learning
SARSA and Q-Learning technique in Reinforcement Learning are algorithms that uses Temporal Difference(TD) Update to improve the agent’s behaviour. Expected SARSA technique is an alternative for improving the agent’s policy. It is very similar to SARSA and Q-Learning, and differs in the action value function it follows.
We know that SARSA is an on-policy technique, Q-learning is an off-policy technique, but Expected SARSA can be use either as an on-policy or off-policy. This is where Expected SARSA is much more flexible compared to both these algorithms.
Let’s compare the action-value function of all the three algorithms and find out what is different in Expected SARSA.
- Expected SARSA:
We see that Expected SARSA takes the weighted sum of all possible next actions with respect to the probability of taking that action. If the Expected Return is greedy with respect to the expected return, then this equation gets transformed to Q-Learning. Otherwise Expected SARSA is on-policy and computes the expected return for all actions, rather than randomly selecting an action like SARSA.
Keeping the theory and the formulae in mind, let us compare all the three algorithms, with an experiment. We shall implement a Cliff Walker as our environment provided by the gym library
Code: Python code to create the class Agent which will be inherited by the other agents to avoid duplicate code.
Attention reader! Don’t stop learning now. Get hold of all the important Machine Learning Concepts with the Machine Learning Foundation Course at a student-friendly price and become industry ready.
Code: Python code to create the SARSA Agent.
Code: Python code to create the Q-Learning Agent.
Code: Python code to create the Expected SARSA Agent. In this experiment we are using the following equation for the policy.
Python code to create an environment and Test all the three algorithms.
We have seen that Expected SARSA performs reasonably well in certain problems. It considers all possible outcomes before selecting a particular action. The fact that Expected SARSA can be used either as an off or on policy, is what makes this algorithm so dynamic.