Expected SARSA in Reinforcement Learning

Prerequisites: SARSA

SARSA and Q-Learning technique in Reinforcement Learning are algorithms that uses Temporal Difference(TD) Update to improve the agent’s behaviour. Expected SARSA technique is an alternative for improving the agent’s policy. It is very similar to SARSA and Q-Learning, and differs in the action value function it follows.
We know that SARSA is an on-policy techique, Q-learning is an off-policy technique, but Expected SARSA can be use either as an on-policy or off-policy. This is where Expected SARSA is much more flexible compared to both these algorithms.

Let’s compare the action-value function of all the three algorithms and find out what is different in Expected SARSA.

  • SARSA:
         Q(S_{t}, A_{t}) = Q(S_{t}, A_{t}) + \alpha (R_{t+1}+\gamma Q(S_{t+1}, A_{t+1})-Q(S_{t}, A_{t}))
  • Q-Learning:
         Q(s_{t}, a_{t}) = Q(s_{t}, a_{t}) + \alpha (r_{t+1}+\gamma max_{a}Q(s_{t+1}, a)-Q(s_{t}, a_{t}))
  • Expected SARSA:
         Q(s_{t}, a_{t}) = Q(s_{t}, a_{t}) + \alpha (r_{t+1}+\gamma \sum_{a} \pi (a | s_{t+1}) Q(s_{t+1}, a)-Q(s_{t}, a_{t}))

We see that Expected SARSA takes the weighted sum of all possible next actions with respect to the probability of taking that action. If the Expected Return is greedy with respect to the expected return, then this equation gets transformed to Q-Learning. Otherwise Expected SARSA is on-policy and computes the expected return for all actions, rather than randomly selecting an action like SARSA.

Keeping the theory and the formulae in mind, let us compare all the three algorithms, with an experiment. We shall implement a Cliff Walker as our environment provided by the gym library



Code: Python code to create the class Agent which will be inherited by the other agents to avoid duplicate code.

filter_none

edit
close

play_arrow

link
brightness_4
code

# Agent.py
  
import numpy as np
  
class Agent:
    """
    The Base class that is implemented by
    other classes to avoid the duplicape 'choose_action'
    method
    """
    def choose_action(self, state): 
        action = 0
        if np.random.uniform(0, 1) < self.epsilon: 
            action = self.action_space.sample()
        else:
            action = np.argmax(self.Q[state, :]) 
        return action

chevron_right


Code: Python code to create the SARSA Agent.

filter_none

edit
close

play_arrow

link
brightness_4
code

# SarsaAgent.py
  
import numpy as np
from Agent import Agent
  
class SarsaAgent(Agent):
    """
    The Agent that uses SARSA update to improve it's behaviour
    """
    def __init__(self, epsilon, alpha, gamma, num_state, num_actions, action_space):
        """
        Contructor
        Args:
            epsilon: The degree of exploration
            gamma: The discount factor
            num_state: The number of states
            num_actions: The number of actions
            action_space: To call the random action
        """
        self.epsilon = epsilon
        self.alpha = alpha
        self.gamma = gamma
        self.num_state = num_state
        self.num_actions = num_actions
  
        self.Q = np.zeros((self.num_state, self.num_actions))
        self.action_space = action_space
  
    def update(self, prev_state, next_state, reward, prev_action, next_action): 
        """
        Update the action value function using the SARSA update.
        Q(S, A) = Q(S, A) + alpha(reward + (gamma * Q(S_, A_) - Q(S, A))
        Args:
            prev_state: The previous state
            next_state: The next state
            reward: The reward for taking the respective action
            prev_action: The previous action
            next_action: The next action
        Returns:
            None
        """
        predict = self.Q[prev_state, prev_action]
        target = reward + self.gamma * self.Q[next_state, next_action]
        self.Q[prev_state, prev_action] += self.alpha * (target - predict)

chevron_right


Code: Python code to create the Q-Learning Agent.

filter_none

edit
close

play_arrow

link
brightness_4
code

# QLearningAgent.py
  
import numpy as np
from Agent import Agent
  
class QLearningAgent(Agent):
    def __init__(self, epsilon, alpha, gamma, num_state, num_actions, action_space):
        """
        Contructor
        Args:
            epsilon: The degree of exploration
            gamma: The discount factor
            num_state: The number of states
            num_actions: The number of actions
            action_space: To call the random action
        """
        self.epsilon = epsilon
        self.alpha = alpha
        self.gamma = gamma
        self.num_state = num_state
        self.num_actions = num_actions
  
        self.Q = np.zeros((self.num_state, self.num_actions))
        self.action_space = action_space
    def update(self, state, state2, reward, action, action2):
        """
        Update the action value function using the Q-Learning update.
        Q(S, A) = Q(S, A) + alpha(reward + (gamma * Q(S_, A_) - Q(S, A))
        Args:
            prev_state: The previous state
            next_state: The next state
            reward: The reward for taking the respective action
            prev_action: The previous action
            next_action: The next action
        Returns:
            None
        """
        predict = self.Q[state, action]
        target = reward + self.gamma * np.max(self.Q[state2, :])
        self.Q[state, action] += self.alpha * (target - predict)

chevron_right


Code: Python code to create the Expected SARSA Agent. In this experiment we are using the following equation for the policy.

 \pi (a | s_{t+1}) = \begin{cases}     \dfrac{\epsilon}{A} &\text{if a = Greedy Action}\\     1 - \epsilon + \dfrac{\epsilon}{\text{Number of Greedy Action}} &\text{if a = Non-Greedy Action}\\     \end{cases}

filter_none

edit
close

play_arrow

link
brightness_4
code

# ExpectedSarsaAgent.py
  
import numpy as np
from Agent import Agent
  
class ExpectedSarsaAgent(Agent):
    def __init__(self, epsilon, alpha, gamma, num_state, num_actions, action_space):
        """
        Contructor
        Args:
            epsilon: The degree of exploration
            gamma: The discount factor
            num_state: The number of states
            num_actions: The number of actions
            action_space: To call the random action
        """
        self.epsilon = epsilon
        self.alpha = alpha
        self.gamma = gamma
        self.num_state = num_state
        self.num_actions = num_actions
  
        self.Q = np.zeros((self.num_state, self.num_actions))
        self.action_space = action_space
    def update(self, prev_state, next_state, reward, prev_action, next_action):
        """
        Update the action value function using the Expected SARSA update.
        Q(S, A) = Q(S, A) + alpha(reward + (pi * Q(S_, A_) - Q(S, A))
        Args:
            prev_state: The previous state
            next_state: The next state
            reward: The reward for taking the respective action
            prev_action: The previous action
            next_action: The next action
        Returns:
            None
        """
        predict = self.Q[prev_state, prev_action]
  
        expected_q = 0
        q_max = np.max(self.Q[next_state, :])
        greedy_actions = 0
        for i in range(self.num_actions):
            if self.Q[next_state][i] == q_max:
                greedy_actions += 1
      
        non_greedy_action_probability = self.epsilon / self.num_actions
        greedy_action_probability = ((1 - self.epsilon) / greedy_actions) + non_greedy_action_probability
  
        for i in range(self.num_actions):
            if self.Q[next_state][i] == q_max:
                expected_q += self.Q[next_state][i] * greedy_action_probability
            else:
                expected_q += self.Q[next_state][i] * non_greedy_action_probability
  
        target = reward + self.gamma * expected_q
        self.Q[prev_state, prev_action] += self.alpha * (target - predict)

chevron_right


Python code to create an environment and Test all the three algorithms.

filter_none

edit
close

play_arrow

link
brightness_4
code

# main.py
  
import gym
import numpy as np
  
from ExpectedSarsaAgent import ExpectedSarsaAgent
from QLearningAgent import QLearningAgent
from SarsaAgent import SarsaAgent
from matplotlib import pyplot as plt
  
# Using the gym library to create the environment
env = gym.make('CliffWalking-v0')
  
# Defining all the required parameters
epsilon = 0.1
total_episodes = 500
max_steps = 100
alpha = 0.5
gamma = 1
"""
    The two parameters below is used to calculate
    the reward by each algorithm
"""
episodeReward = 0
totalReward = {
    'SarsaAgent': [],
    'QLearningAgent': [],
    'ExpectedSarsaAgent': []
}
  
# Defining all the three agents
expectedSarsaAgent = ExpectedSarsaAgent(
    epsilon, alpha, gamma, env.observation_space.n, 
    env.action_space.n, env.action_space)
qLearningAgent = QLearningAgent(
    epsilon, alpha, gamma, env.observation_space.n, 
    env.action_space.n, env.action_space)
sarsaAgent = SarsaAgent(
    epsilon, alpha, gamma, env.observation_space.n, 
    env.action_space.n, env.action_space)
  
# Now we run all the episodes and calculate the reward obtained by
# each agent at the end of the episode
  
agents = [expectedSarsaAgent, qLearningAgent, sarsaAgent]
  
for agent in agents:
    for _ in range(total_episodes):
        # Initialize the necesary parameters before 
        # the start of the episode
        t = 0
        state1 = env.reset() 
        action1 = agent.choose_action(state1) 
        episodeReward = 0
        while t < max_steps:
  
            # Getting the next state, reward, and other parameters
            state2, reward, done, info = env.step(action1) 
      
            # Choosing the next action 
            action2 = agent.choose_action(state2) 
              
            # Learning the Q-value 
            agent.update(state1, state2, reward, action1, action2) 
      
            state1 = state2 
            action1 = action2 
              
            # Updating the respective vaLues 
            t += 1
            episodeReward += reward
              
            # If at the end of learning process 
            if done: 
                break
        # Append the sum of reward at the end of the episode
        totalReward[type(agent).__name__].append(episodeReward)
env.close()
  
# Calculate the mean of sum of returns for each episode
meanReturn = {
    'SARSA-Agent': np.mean(totalReward['SarsaAgent']),
    'Q-Learning-Agent': np.mean(totalReward['QLearningAgent']),
    'Expected-SARSA-Agent': np.mean(totalReward['ExpectedSarsaAgent'])
}
  
# Print the results
print(f"SARSA Average Sum of Reward: {meanReturn['SARSA-Agent']}")
print(f"Q-Learning Average Sum of Return: {meanReturn['Q-Learning-Agent']}")
print(f"Expected Sarsa Average Sum of Return: {meanReturn['Expected-SARSA-Agent']}")

chevron_right


Output:

Conclusion:
We have seen that Expected SARSA performs reasonably well in certain problems. It considers all possible outcomes before selecting a particular action. The fact that Expected SARSA can be used either as an off or on policy, is what makes this algorithm so dynamic.

Attention geek! Strengthen your foundations with the Python Programming Foundation Course and learn the basics.

To begin with, your interview preparations Enhance your Data Structures concepts with the Python DS Course.




My Personal Notes arrow_drop_up

Check out this Author's contributed articles.

If you like GeeksforGeeks and would like to contribute, you can also write an article using contribute.geeksforgeeks.org or mail your article to contribute@geeksforgeeks.org. See your article appearing on the GeeksforGeeks main page and help other Geeks.

Please Improve this article if you find anything incorrect by clicking on the "Improve Article" button below.


Article Tags :
Practice Tags :


1


Please write to us at contribute@geeksforgeeks.org to report any issue with the above content.