Skip to content
Related Articles
Expected SARSA in Reinforcement Learning
• Last Updated : 28 Apr, 2021

Prerequisites: SARSA
SARSA and Q-Learning technique in Reinforcement Learning are algorithms that uses Temporal Difference(TD) Update to improve the agent’s behaviour. Expected SARSA technique is an alternative for improving the agent’s policy. It is very similar to SARSA and Q-Learning, and differs in the action value function it follows.
We know that SARSA is an on-policy technique, Q-learning is an off-policy technique, but Expected SARSA can be use either as an on-policy or off-policy. This is where Expected SARSA is much more flexible compared to both these algorithms.
Let’s compare the action-value function of all the three algorithms and find out what is different in Expected SARSA.

• SARSA: • Q-Learning: • Expected SARSA: We see that Expected SARSA takes the weighted sum of all possible next actions with respect to the probability of taking that action. If the Expected Return is greedy with respect to the expected return, then this equation gets transformed to Q-Learning. Otherwise Expected SARSA is on-policy and computes the expected return for all actions, rather than randomly selecting an action like SARSA.
Keeping the theory and the formulae in mind, let us compare all the three algorithms, with an experiment. We shall implement a Cliff Walker as our environment provided by the gym library
Code: Python code to create the class Agent which will be inherited by the other agents to avoid duplicate code.

## Python3

 # Agent.py import numpy as np class Agent:    """    The Base class that is implemented by    other classes to avoid the duplicate 'choose_action'    method    """    def choose_action(self, state):        action = 0        if np.random.uniform(0, 1) < self.epsilon:            action = self.action_space.sample()        else:            action = np.argmax(self.Q[state, :])        return action

Code: Python code to create the SARSA Agent.

## Python3

 # SarsaAgent.py import numpy as npfrom Agent import Agent class SarsaAgent(Agent):    """    The Agent that uses SARSA update to improve it's behaviour    """    def __init__(self, epsilon, alpha, gamma, num_state, num_actions, action_space):        """        Constructor        Args:            epsilon: The degree of exploration            gamma: The discount factor            num_state: The number of states            num_actions: The number of actions            action_space: To call the random action        """        self.epsilon = epsilon        self.alpha = alpha        self.gamma = gamma        self.num_state = num_state        self.num_actions = num_actions         self.Q = np.zeros((self.num_state, self.num_actions))        self.action_space = action_space     def update(self, prev_state, next_state, reward, prev_action, next_action):        """        Update the action value function using the SARSA update.        Q(S, A) = Q(S, A) + alpha(reward + (gamma * Q(S_, A_) - Q(S, A))        Args:            prev_state: The previous state            next_state: The next state            reward: The reward for taking the respective action            prev_action: The previous action            next_action: The next action        Returns:            None        """        predict = self.Q[prev_state, prev_action]        target = reward + self.gamma * self.Q[next_state, next_action]        self.Q[prev_state, prev_action] += self.alpha * (target - predict)

Code: Python code to create the Q-Learning Agent.

## Python3

 # QLearningAgent.py import numpy as npfrom Agent import Agent class QLearningAgent(Agent):    def __init__(self, epsilon, alpha, gamma, num_state, num_actions, action_space):        """        Constructor        Args:            epsilon: The degree of exploration            gamma: The discount factor            num_state: The number of states            num_actions: The number of actions            action_space: To call the random action        """        self.epsilon = epsilon        self.alpha = alpha        self.gamma = gamma        self.num_state = num_state        self.num_actions = num_actions         self.Q = np.zeros((self.num_state, self.num_actions))        self.action_space = action_space    def update(self, state, state2, reward, action, action2):        """        Update the action value function using the Q-Learning update.        Q(S, A) = Q(S, A) + alpha(reward + (gamma * Q(S_, A_) - Q(S, A))        Args:            prev_state: The previous state            next_state: The next state            reward: The reward for taking the respective action            prev_action: The previous action            next_action: The next action        Returns:            None        """        predict = self.Q[state, action]        target = reward + self.gamma * np.max(self.Q[state2, :])        self.Q[state, action] += self.alpha * (target - predict)

Code: Python code to create the Expected SARSA Agent. In this experiment we are using the following equation for the policy. ## Python3

 # ExpectedSarsaAgent.py import numpy as npfrom Agent import Agent class ExpectedSarsaAgent(Agent):    def __init__(self, epsilon, alpha, gamma, num_state, num_actions, action_space):        """        Constructor        Args:            epsilon: The degree of exploration            gamma: The discount factor            num_state: The number of states            num_actions: The number of actions            action_space: To call the random action        """        self.epsilon = epsilon        self.alpha = alpha        self.gamma = gamma        self.num_state = num_state        self.num_actions = num_actions         self.Q = np.zeros((self.num_state, self.num_actions))        self.action_space = action_space    def update(self, prev_state, next_state, reward, prev_action, next_action):        """        Update the action value function using the Expected SARSA update.        Q(S, A) = Q(S, A) + alpha(reward + (pi * Q(S_, A_) - Q(S, A))        Args:            prev_state: The previous state            next_state: The next state            reward: The reward for taking the respective action            prev_action: The previous action            next_action: The next action        Returns:            None        """        predict = self.Q[prev_state, prev_action]         expected_q = 0        q_max = np.max(self.Q[next_state, :])        greedy_actions = 0        for i in range(self.num_actions):            if self.Q[next_state][i] == q_max:                greedy_actions += 1             non_greedy_action_probability = self.epsilon / self.num_actions        greedy_action_probability = ((1 - self.epsilon) / greedy_actions) + non_greedy_action_probability         for i in range(self.num_actions):            if self.Q[next_state][i] == q_max:                expected_q += self.Q[next_state][i] * greedy_action_probability            else:                expected_q += self.Q[next_state][i] * non_greedy_action_probability         target = reward + self.gamma * expected_q        self.Q[prev_state, prev_action] += self.alpha * (target - predict)

Python code to create an environment and Test all the three algorithms.

## Python3

 # main.py import gymimport numpy as np from ExpectedSarsaAgent import ExpectedSarsaAgentfrom QLearningAgent import QLearningAgentfrom SarsaAgent import SarsaAgentfrom matplotlib import pyplot as plt # Using the gym library to create the environmentenv = gym.make('CliffWalking-v0') # Defining all the required parametersepsilon = 0.1total_episodes = 500max_steps = 100alpha = 0.5gamma = 1"""    The two parameters below is used to calculate    the reward by each algorithm"""episodeReward = 0totalReward = {    'SarsaAgent': [],    'QLearningAgent': [],    'ExpectedSarsaAgent': []} # Defining all the three agentsexpectedSarsaAgent = ExpectedSarsaAgent(    epsilon, alpha, gamma, env.observation_space.n,    env.action_space.n, env.action_space)qLearningAgent = QLearningAgent(    epsilon, alpha, gamma, env.observation_space.n,    env.action_space.n, env.action_space)sarsaAgent = SarsaAgent(    epsilon, alpha, gamma, env.observation_space.n,    env.action_space.n, env.action_space) # Now we run all the episodes and calculate the reward obtained by# each agent at the end of the episode agents = [expectedSarsaAgent, qLearningAgent, sarsaAgent] for agent in agents:    for _ in range(total_episodes):        # Initialize the necessary parameters before        # the start of the episode        t = 0        state1 = env.reset()        action1 = agent.choose_action(state1)        episodeReward = 0        while t < max_steps:             # Getting the next state, reward, and other parameters            state2, reward, done, info = env.step(action1)                 # Choosing the next action            action2 = agent.choose_action(state2)                         # Learning the Q-value            agent.update(state1, state2, reward, action1, action2)                 state1 = state2            action1 = action2                         # Updating the respective vaLues            t += 1            episodeReward += reward                         # If at the end of learning process            if done:                break        # Append the sum of reward at the end of the episode        totalReward[type(agent).__name__].append(episodeReward)env.close() # Calculate the mean of sum of returns for each episodemeanReturn = {    'SARSA-Agent': np.mean(totalReward['SarsaAgent']),    'Q-Learning-Agent': np.mean(totalReward['QLearningAgent']),    'Expected-SARSA-Agent': np.mean(totalReward['ExpectedSarsaAgent'])} # Print the resultsprint(f"SARSA Average Sum of Reward: {meanReturn['SARSA-Agent']}")print(f"Q-Learning Average Sum of Return: {meanReturn['Q-Learning-Agent']}")print(f"Expected Sarsa Average Sum of Return: {meanReturn['Expected-SARSA-Agent']}")

Output: Conclusion:
We have seen that Expected SARSA performs reasonably well in certain problems. It considers all possible outcomes before selecting a particular action. The fact that Expected SARSA can be used either as an off or on policy, is what makes this algorithm so dynamic.

Attention geek! Strengthen your foundations with the Python Programming Foundation Course and learn the basics.

To begin with, your interview preparations Enhance your Data Structures concepts with the Python DS Course. And to begin with your Machine Learning Journey, join the Machine Learning – Basic Level Course

My Personal Notes arrow_drop_up