Open In App

Model-Free Reinforcement Learning: An Overview

In Reinforcement learning, we have agents that use a particular ‘algorithm’ to interact with the environment.

Now this algorithm can be model-based or model-free.



In this article we will first get a basic overview of RL, then we will discuss the difference between model-based and model-free algorithms in detail. Then we will study the Bellman equation which is the basis of the model-free learning algorithm and then we will see the difference between on-policy and off-policy, value-based, and policy-based model-free algorithms. Then we will get an overview of major model-free algorithms. Finally, we will do a simple implementation of the Q-learning algorithm using an open AI gym.

Reinforcement Learning

Generally, when we talk about machine learning problems, we think of categorizing them into supervised and unsupervised learning. However, there is another less glamorized category which is reinforcement learning (RL).



In reinforcement learning, we train an agent by making it interact with the environment. Unlike supervised learning or unsupervised learning where we have an unlabeled or labeled dataset to start with, in reinforcement learning there is no data to start with. The agent gets the data from its interaction with the environment. It uses this data to either build an optimal policy that could guide it or build a model that could simulate the environment. In reinforcement learning the objective of the agent is to learn a policy /model that would decide the behavior of the agent in an environment

Here by environment, we mean a dynamic context in which the agent operates. For example, think of a self-driving car. The road, the signal, the pedestrian, another vehicle, etc. is the environment. The self-driving car (‘agent’) interacts with the ‘environment’ through ‘action’ (moving, accelerating turning right/left, stopping, etc.). For each interaction, the environment produces a new state and ‘reward’ (notional high value for correct moves, notional low values for incorrect values). Through these rewards, the agent learns a ‘policy’ or ‘model’. The ‘policy’ or ‘model’ describes what action needs to be taken by the agent in a given ‘state’.

All the above components described are the components of the Markov Decision Process (MDP). Formally MDP consists of

Model-based vs Model Free Reinforcement Learning (RL)

The environment with which the agent interacts is a black box for the agent i.e. how it operates is not known to the agent. Based on what the agent tries to learn about this black box we can define the RL in two categories.

Model-Based

Model-free

So, in the remainder of this article, we will be focusing our discussion on only Model free approaches of RL. Model-free algorithms are generally further characterized by whether there are value-based, or policy based and on-policy or off-policy.

Value-based and Policy Based RL

Off Policy and On Policy Model Free RL

This division is generally done for value-based policy on how the Q values are updated.

Bellman Equation in RL

The Bellman equation is the foundation of many algorithms in RL. It has many forms depending on the type of algorithm and the value being optimized.

The return for any state action value is decomposed into two parts.

This recursive relationship is known as the Bellman Equation.

V(S1) = E[R + YV(S2)]

Let us understand Bellman equation with the help of the q value.

For example, Bellman considers the q value in valued based model free RL. When we build a value function in an RL algorithm, we update it with a value called Q i.e. Quality value for each state action pair.

Consider a simple naive environment where there are only 5 possible state s’ and 4 possible actions. Hence, we can develop a look-up table with rows representing the state s’ and columns representing the value. Each of the values in the matrix represents the reward obtained by taking that particle action given the agent is in that state. This value is known as the q value. It is this q value that the agent learns through its interaction with the environment. Once the agent has interacted with the environment for a sufficient amount of time the value contained in the table will be optimal values. Based on this value the agent will decide its action.

Formally the Q-value of a state-action pair is denoted as Q(s, a), where:

Bellman equation expresses the relationship between the value of a state (or state-action pair) and the expected future rewards achievable from that state.

Whenever an agent interacts with an environment it gets two things in return – the immediate reward for that action and its successor state. Based on these two it updates the Q value of the current state

Q(s,a) = E [Rt+1 + max(Q(s’,a’)]

The expected return from starting state s, taking action a, and with the optimal policy afterward will be equal to the expected reward Rt+1 we can get by selecting action a in state s plus the maximum of “expected discounted return” that is achievable of any (s′, a′) where (s′, a′) is a potential next state-action pair

Model-Free Algorithms

Let us discuss some popular model-free algorithms.

Q-learning

Q-learning is a classic RL algorithm that aims to learn the quality of actions in a given state. The Q-value represents the expected cumulative reward of taking a particular action in a specific state. We covered the Q learning algorithm in detail when we discussed the on-policy model-free algorithm.

In this behavioral policy i.e. the policy with which it picks up the action and the policy that it is to learn is different. The bellman equation is given by:

SARSA

SARSA stands for State Action Reward State Action. The updated equation for SARSA depends on the current state, current action, reward obtained, next state, and next action. SARSA operates by choosing an action following the current epsilon-greedy policy and updates its Q-values accordingly.

In this, the behavioral policy and target policy is the same. The Bellman equation for state action value pair is given by:

DQN

DQN is based on the Q learning algorithm. It integrates the deep learning technique with Q learning. In Q learning we develop a table for state action pairs. However, this is not feasible in scenarios where the number of state action pairs reaches high. So instead of making a value function like a table we develop a neural network that plays the role of a function that outputs quality value for a given input of state and action. So instead of using a table to store Q-values for each state-action pair, a deep neural network is used to approximate the Q-function.

Actor Critic

Actor critic combines elements of both policy-based (actor) and value-based (critic) methods. The main idea is to have two components working together: an actor, which learns a policy to select actions, and a critic, which evaluates the chosen actions.

Implementation of Model Free RL

We will use OpenAI gymnasium (also known formerly as OpenAI gym) to build a model-free RL using a Q learning algorithm (off policy). Now to train an RL agent we need an environment that can provide a simulation of the environment. This is what the open gym toolkit does. It provides a variety of environments that can be used for testing different reinforcement learning algorithms. Users can also create and register their custom environments in OpenAI Gym, allowing them to test algorithms on specific tasks relevant to their research or application.

Just as Pytorch or TensorFlow have become the standard framework for implementing deep learning tasks, OpenAI gym has become the default standard for benchmarking and evaluation of RL algorithms.

To install the gymnasium, we can use the below command

!pip install gymnasium

1. Understand the environment

We will be using the taxi-v3 environment from the OpenAI gym.

import gymnasium as gym
env = gym.make('Taxi-v3',render_mode='ansi')
env.reset()
 
print(env.render())

                    

Output:

Taxi-V3 environment

2. Create the q learning agent

import numpy as np
from collections import defaultdict
import matplotlib.pyplot as plt
 
 
class QLearningAgent:
    def __init__(self, env, learning_rate, initial_epsilon, epsilon_decay, final_epsilon, discount_factor=0.95
                 ):
        self.env = env
        self.learning_rate = learning_rate
        self.discount_factor = discount_factor
 
        self.epsilon = initial_epsilon
        self.epsilon_decay = epsilon_decay
        self.final_epsilon = final_epsilon
 
        # Initialize an empty dictionary of state-action values
        self.q_values = defaultdict(lambda: np.zeros(env.action_space.n))
 
    def get_action(self, obs) -> int:
        x = np.random.rand()
        if x < self.final_epsilon:
            return self.env.action_space.sample()
        else:
            return np.argmax(self.q_values[obs])
 
    def update(self, obs, action, reward, terminated, next_obs):
        if not terminated:
            future_q_value = np.max(self.q_values[next_obs])
            self.q_values[obs][action] += self.learning_rate * \
                (reward + self.discount_factor *
                 future_q_value-self.q_values[obs][action])
 
    def decay_epsilon(self):
        """Decrease the exploration rate epsilon until it reaches its final value"""
        self.epsilon = max(self.final_epsilon,
                           self.epsilon - self.epsilon_decay)

                    

3. Define our training method

def train_agent(agent, env, episodes, eval_interval=100):
    rewards = []
    best_reward = -np.inf
    for i in range(episodes):
        obs, _ = env.reset()
        terminated = False
        truncated = False
        length = 0
        total_reward = 0
 
        while (terminated == False) and (truncated == False):
 
            action = agent.get_action(obs)
            next_obs, reward, terminated, truncated, _ = env.step(action)
 
            agent.update(obs, action, reward, terminated, next_obs)
            obs = next_obs
            length = length+1
            total_reward += reward
 
        agent.decay_epsilon()
        rewards.append(total_reward)
 
        if i >= eval_interval:
            avg_return = np.mean(rewards[i-eval_interval: i])
            best_reward = max(avg_return, best_reward)
        if i % eval_interval == 0 and i > 0:
 
            print(f"Episode{i} -> best_reward={best_reward} ")
    return rewards

                    

4. Running our training method

episodes = 20000
learning_rate = 0.5
discount_factor = 0.95
initial_epsilon = 1
final_epsilon = 0
epsilon_decay = ((final_epsilon-initial_epsilon) / (episodes/2))
env = gym.make('Taxi-v3', render_mode='ansi')
agent = QLearningAgent(env, learning_rate, initial_epsilon,
                       epsilon_decay, final_epsilon)
 
returns = train_agent(agent, env, episodes)

                    

Output:

Episode100 -> best_reward=-224.3 
Episode200 -> best_reward=-116.22 
Episode300 -> best_reward=-40.75 
Episode400 -> best_reward=-14.89 
Episode500 -> best_reward=-3.9 
Episode600 -> best_reward=1.65 
Episode700 -> best_reward=2.13 
Episode800 -> best_reward=2.13 
Episode900 -> best_reward=3.3 
Episode1000 -> best_reward=4.32 
Episode1100 -> best_reward=6.03 
Episode1200 -> best_reward=6.28 
Episode1300 -> best_reward=7.15 
Episode1400 -> best_reward=7.62 
...

5. Plotting our returns

def plot_returns(returns):
    plt.plot(np.arange(len(returns)), returns)
    plt.title('Episode returns')
    plt.xlabel('Episode')
    plt.ylabel('Return')
    plt.show()
 
plot_returns(returns)

                    

Output:

Plot of rewards

6. Running our Agent

The run_agent function is designed to execute our trained agent in the Taxi-v3 environment and displays its interaction

def run_agent(agent, env):
    agent.epsilon = 0    # No need to keep exploring
    obs, _ = env.reset() # get the current state
    env.render()
    terminated = truncated = False
 
    while terminated == False and truncated == False   :    
        action = agent.get_action(obs)       
        next_obs, _, terminated, truncated, _ = env.step(action)
        print(env.render())
         
        obs = next_obs
 
env = gym.make('Taxi-v3', render_mode='ansi')
run_agent(agent, env)

                    

Output:

Output of the agent action

Conclusion

In this article, we got an understanding of reinforcement learning, difference between model-free and model-based RL, how q value is determined in model-free RL, the difference between on-policy and off-policy algorithms, and the saw the implementation of the Q learning algorithm.


Article Tags :