Open In App

Q-Learning in Python

Reinforcement Learning is a paradigm of the Learning Process in which a learning agent learns, over time, to behave optimally in a certain environment by interacting continuously in the environment. The agent during its course of learning experiences various situations in the environment it is in. These are called states. The agent while being in that state may choose from a set of allowable actions which may fetch different rewards (or penalties). Over time, The learning agent learns to maximize these rewards to behave optimally at any given state it is in. Q-learning is a basic form of Reinforcement Learning that uses Q-values (also called action values) to iteratively improve the behavior of the learning agent.

This example helps us to better understand reinforcement learning.

Q-Learning

Q-Learning

Q-learning in Reinforcement Learning

Q-learning is a popular model-free reinforcement learning algorithm used in machine learning and artificial intelligence applications. It falls under the category of temporal difference learning techniques, in which an agent picks up new information by observing results, interacting with the environment, and getting feedback in the form of rewards.

Key Components of Q-learning

  1. Q-Values or Action-Values: Q-values are defined for states and actions. [Tex]Q(S, A) [/Tex] is an estimation of how good is it to take the action A at the state S . This estimation of [Tex]Q(S, A) [/Tex] will be iteratively computed using the TD- Update rule which we will see in the upcoming sections.
  2. Rewards and Episodes: An agent throughout its lifetime starts from a start state, and makes several transitions from its current state to a next state based on its choice of action and also the environment the agent is interacting in. At every step of transition, the agent from a state takes an action, observes a reward from the environment, and then transits to another state. If at any point in time, the agent ends up in one of the terminating states that means there are no further transitions possible. This is said to be the completion of an episode.
  3. Temporal Difference or TD-Update: The Temporal Difference or TD-Update rule can be represented as follows:
    [Tex]Q(S,A)\leftarrow Q(S,A) + \alpha (R + \gamma Q({S}',{A}') - Q(S,A)) [/Tex]This update rule to estimate the value of Q is applied at every time step of the agent's interaction with the environment. The terms used are explained below:
    • S - Current State of the agent.
    • A - Current Action Picked according to some policy.
    • S' - Next State where the agent ends up.
    • A' - Next best action to be picked using current Q-value estimation, i.e. pick the action with the maximum Q-value in the next state.
    • R - Current Reward observed from the environment in Response of current action.
    • [Tex]\gamma [/Tex](>0 and <=1) : Discounting Factor for Future Rewards. Future rewards are less valuable than current rewards so they must be discounted. Since Q-value is an estimation of expected rewards from a state, discounting rule applies here as well.
    • [Tex]\alpha [/Tex]: Step length taken to update the estimation of Q(S, A).
  4. Selecting the Course of Action with ϵ-greedy policy: A simple method for selecting an action to take based on the current estimates of the Q-value is the ϵ-greedy policy. This is how it operates:

Superior Q-Value Action (Exploitation):

  • With a probability of 1−ϵ, representing the majority of cases,
  • Select the action with the highest Q-value at the moment.
  • In this instance of exploitation, the agent chooses the course of action that, given its current understanding, it feels is optimal.

Exploration through Random Action:

  • With probability ϵ, occasionally,
  • Rather than selecting the course of action with the highest Q-value,
  • Select any action at random, irrespective of Q-values.
  • In order to learn about the possible benefits of new actions, the agent engages in a type of exploration.

How does Q-Learning Works?

Q-learning models engage in an iterative process where various components collaborate to train the model. This iterative procedure encompasses the agent exploring the environment and continuously updating the model based on this exploration. The key components of Q-learning include:

  1. States: Variables that identify an agent's current position in the environment.
  2. Actions: Operations undertaken by the agent in specific states.
  3. Rewards: Positive or negative responses provided to the agent based on its actions.
  4. Episodes: Instances where an agent concludes its actions, marking the end of an episode.
  5. Q-values: Metrics used to evaluate actions at specific states.

There are two methods for determining Q-values:

Temporal Difference: Calculated by comparing the current state and action values with the previous ones.

Bellman's Equation: A recursive formula invented by Richard Bellman in 1957, used to calculate the value of a given state and determine its optimal position. It provides a recursive formula for calculating the value of a given state in a Markov Decision Process (MDP) and is particularly influential in the context of Q-learning and optimal decision-making.

The Equation is expressed as :

[Tex]Q(s,a) = R(s,a) + \gamma \;\; max_a Q(s',a)[/Tex]

Where,

Bellman's equation is crucial in reinforcement learning as it helps in evaluating the long-term expected rewards associated with different actions in a given state. It forms the basis for Q-learning algorithms, guiding agents to learn optimal policies through iterative updates based on observed experiences.

What is Q-table?

The Q-table functions as a repository of rewards associated with optimal actions for each state in a given environment. It serves as a guide for the agent, indicating which actions are likely to yield positive outcomes in various scenarios.

Each row in the Q-table corresponds to a distinct situation the agent might face, while the columns represent the available actions. Through interactions with the environment and the receipt of rewards or penalties, the Q-table is dynamically updated to capture the model's evolving understanding.

Reinforcement learning aims to enhance performance by refining the Q-table, enabling the agent to make informed decisions. As the Q-table undergoes continuous updates with more feedback, it becomes a more accurate resource, empowering the agent to make optimal choices and achieve superior results.

Crucially, the Q-table is closely tied to the Q-function, a mathematical expression that considers the current state and action, generating outputs that include anticipated future rewards for that specific state-action pair. By consulting the Q-table, the agent can retrieve expected future rewards, guiding it toward optimized decision-making and states.

Implementation of Q-Learning

Defining Enviroment and parameters

import numpy as np

# Define the environment
n_states = 16  # Number of states in the grid world
n_actions = 4  # Number of possible actions (up, down, left, right)
goal_state = 15  # Goal state

# Initialize Q-table with zeros
Q_table = np.zeros((n_states, n_actions))

# Define parameters
learning_rate = 0.8
discount_factor = 0.95
exploration_prob = 0.2
epochs = 1000

n this Q-learning implementation, a grid world environment is defined with 16 states, and agents can take 4 possible actions: up, down, left, and right. The goal is to reach state 15. The Q-table, initialized with zeros, serves as a memory to store Q-values for state-action pairs.

The learning parameters include a learning rate of 0.8, a discount factor of 0.95, an exploration probability of 0.2, and a total of 1000 training epochs. The learning rate influences the weight given to new information, the discount factor adjusts the importance of future rewards, and the exploration probability determines the likelihood of the agent exploring new actions versus exploiting known actions.

Throughout the training epochs, the agent explores the environment, updating Q-values based on received rewards and future expectations, ultimately learning a strategy to navigate the grid world towards the goal state.

Implement Q-Algorithm

# Q-learning algorithm
for epoch in range(epochs):
    current_state = np.random.randint(0, n_states)  # Start from a random state

    while current_state != goal_state:
        # Choose action with epsilon-greedy strategy
        if np.random.rand() < exploration_prob:
            action = np.random.randint(0, n_actions)  # Explore
        else:
            action = np.argmax(Q_table[current_state])  # Exploit

        # Simulate the environment (move to the next state)
        # For simplicity, move to the next state
        next_state = (current_state + 1) % n_states

        # Define a simple reward function (1 if the goal state is reached, 0 otherwise)
        reward = 1 if next_state == goal_state else 0

        # Update Q-value using the Q-learning update rule
        Q_table[current_state, action] += learning_rate * \
            (reward + discount_factor *
             np.max(Q_table[next_state]) - Q_table[current_state, action])

        current_state = next_state  # Move to the next state

# After training, the Q-table represents the learned Q-values
print("Learned Q-table:")
print(Q_table)

Output:

Learned Q-table:
[[0.48767498 0.48377358 0.48751874 0.48377357]
 [0.51252074 0.51317781 0.51334071 0.51334208]
 [0.54036009 0.5403255  0.54018713 0.54036009]
 [0.56880009 0.56880009 0.56880008 0.56880009]
 [0.59873694 0.59873694 0.59873694 0.59873694]
 [0.63024941 0.63024941 0.63024941 0.63024941]
 [0.66342043 0.66342043 0.66342043 0.66342043]
 [0.6983373  0.6983373  0.6983373  0.6983373 ]
 [0.73509189 0.73509189 0.73509189 0.73509189]
 [0.77378094 0.77378094 0.77378094 0.77378094]
 [0.81450625 0.81450625 0.81450625 0.81450625]
 [0.857375   0.857375   0.857375   0.857375  ]
 [0.9025     0.9025     0.9025     0.9025    ]
 [0.95       0.95       0.95       0.95      ]
 [1.         1.         1.         1.        ]
 [0.         0.         0.         0.        ]]

The Q-learning algorithm involves iterative training where the agent explores and updates its Q-table. It starts from a random state, selects actions via epsilon-greedy strategy, and simulates movements. A reward function grants a 1 for reaching the goal state. Q-values update using the Q-learning rule, combining received and expected rewards. This process continues until the agent learns optimal strategies. The final Q-table represents acquired state-action values after training.

Q-learning Advantages and Disadvantages

Advantages:

Disadvantages:

Q-learning Applications

Applications for Q-learning, a reinforcement learning algorithm, can be found in many different fields. Here are a few noteworthy instances:

Playing Games:

Automation:

Driverless Automobiles:

Finance:

Health Care:

Energy Management:

Education:

Recommendations Systems:

Resources Management:

Space Travel:

Frequently Asked Questions (FAQs) on Q-Learning

Q. What is Q-learning?

A machine learning technique called Q-learning allows a model to learn iteratively and get better over time by making the right decisions. One kind of reinforcement learning is Q-learning.

Q. What is Reinforcement Learning?

A machine learning technique known as reinforcement learning uses feedback to teach an agent how to behave in a given environment by having it perform actions and observe the outcomes of those actions. The agent receives positive feedback for each action they take that goes well and negative feedback or a penalty for each action they take that goes wrong.

Q. What is a recommendations system?

A software programme that gives users recommendations or suggestions is called a recommendation system. These systems employ algorithms to examine user preferences and actions in order to make product, movie, or article recommendations.

Q. Why is Q-learning used?

Q-Learning is a Reinforcement learning policy designed to determine the optimal course of action based on the current state. It selects this action at random with the goal of getting the biggest reward.

Article Tags :