REINFORCE Algorithm

Last Updated : 27 Jan, 2024

The REINFORCE algorithm is a reinforcement learning algorithm that adjusts the weights of a neural network after each trial. The algorithm is a Monte Carlo variant of a policy gradient algorithm. The article highlights the features and fundamentals of the REINFORCE algorithm.

Table of Content

Basics of Reinforcement Learning
What is REINFORCE Algorithm?
Implementation of REINFORCE Algorithm

Basics of Reinforcement Learning

Reinforcement Learning is a machine learning algorithm that trains the agent by rewarding on good actions and punishing them for bad actions.

Important Terms of Reinforcement Learning

Let’s assume, that we are teaching a robot to play a game then the robot is our agent. The environment is the game world that includes characters, obstacles and everything the robot interacts with. The actions are the moves or decisions the robot can make, like going left or right. In the game, getting points or losing points is the reward based on the action of the agent.

Policy Gradient

The policy gradient method focusses on learning a policy – a strategy or set of rules guiding an agent’s decision-making process. The policy gradient method is represented by a parameterized function, such as neural network. The function takes the state of environment as input and provide output as probability distribution over the possible actions.

Monte Carlo Methods

The expected reward is estimated using Monte Carlo methods to estimate the expected reward. The method involves sampling sequences of actions, states, and rewards and use them to update the policy.

What is REINFORCE Algorithm?

REINFORCE algorithm was introduced by Ronald J. Williams in 1992. The aim of the algorithm was to maximize the expected cumulative reward by adjusting the policy parameters. The REINFORCE Algorithm is used to train agents to make sequential decision in an environment. It is a policy gradient method that belongs to the family of Monte Carlo algorithms. In REINFORCE, a neural network is employed to present a policy, which is a strategy guiding the agent’s action in different states.

The algorithm updates the neural network’s parameters based on the obtained rewards, aiming to enhance the likelihood of actions that lead to higher cumulative rewards. This is an iterative process that allows the agent to learn a policy for decision-making in the given environment.

REward Increment = Non-negative Factor × Offset Reinforcement × Characteristic Eligibility

Algorithm

Set up the policy parameters: First, establish an initial policy. A parametric representation or a neural network might be used for this.
Get Paths of Collection: In order to gather a collection of trajectories—sequences of states, actions, and rewards—execute the present policy in the environment.
Determine Returns: Compute the return, or the total of the discounted rewards from each state forward, for each one in the trajectory.
Calculate the Policy Gradient: Determine the gradient of the anticipated return about the parameters of the policy. To do this, one must compute the gradient of the log likelihood of the chosen course of action.
Refresh the policy parameters: The policy parameters should be updated in a way that makes it more likely that the actions will result in greater returns. Gradient ascent is usually used for this.
Repeat: Steps 2 through 5 should be repeated many times.

REINFORCE with Baseline

The policy gradient theorem in the context of episodic scenarios states that:

$\nabla J(\theta) \propto \sum_{s}{\mu(s)}\sum_{a}{q_\pi(s,a)\nabla {\pi}(a|s, \theta)}$

Here,

the gradient are column vectors of partial derivatives with respect to the components of $\theta$
$\pi$ denotes the policy corresponding to parameter vector $\theta$
the distribution $\mu$ is the on-policy distribution under $\pi$

The policy gradient theorem can be extended to incorporate a comparison between the action value and a user-defined baseline, denoted as b(s):

$\nabla J(\theta) \propto \sum_{s}{\mu(s)}\sum_{a}{(q_\pi(s,a)-b(s))\nabla\pi(a|s, \theta)}$

The baseline can take the form of any function, or even a random variable, as long as it remains constant across different actions (denoted as a); the equation remains accurate because the subtracted quantity is zero.

$\sum_{a}{b(s)}\nabla \pi(a|s,\theta) = b(s) \nabla \sum_{a} \pi (a|s, \theta) = b(s) \nabla 1 = 0$

The policy gradient theorem incorporating a baseline can be utilized to derive an update rule through analogous steps as in the preceding section. The resulting update rule is a modified iteration of REINFORCE that incorporates a versatile baseline.

$\theta_{t+1} = \theta_t + \alpha(G_t -b(S_t))\frac{\nabla \pi (A_t|S_t, \theta_t)}{\pi(A_t|S_t, \theta_t)}$

As the baseline has the potential to be uniformly zero, this update represents a clear extension of REINFORCE. Typically, the baseline does not alter the expected value of the update, but it can significantly impact its variance.

Implementation of REINFORCE Algorithm

Let’s look at a simple scenario in which an agent picks up certain gaming skills. A neural network that generates probability of executing certain actions might serve as the policy. By playing the game, the agent gathers trajectories, computes return, uses policy gradients to change the neural network’s parameters, and then repeats the procedure.

Python

import numpy as np
 
# Initialization
def initialize_policy(num_actions):
    return np.random.rand(num_actions)
 
# Other functions
def collect_trajectories():
    # For simplicity, let's assume a fixed trajectory for each episode
    states = [[1, 2, 3], [4, 5, 6], [7, 8, 9]]
    actions = [0, 1, 2]
    rewards = [0.1, 0.5, 0.2]
    return states, actions, rewards
 
def compute_returns(rewards):
    # Assume this function computes returns from rewards
    pass
 
def compute_policy_gradient(states, actions, returns):
    # Assume this function computes the policy gradient
    pass
 
def update_policy_parameters(policy_gradient):
    # Assume this function updates the policy parameters using the gradient
    pass
 
# Example Python Code
 
# Initialization
num_actions = 3
policy_parameters = initialize_policy(num_actions)
 
# Training loop
num_episodes = 1000
for episode in range(num_episodes):
    # Collect trajectories
    states, actions, rewards = collect_trajectories()
 
    # Compute returns
    returns = compute_returns(rewards)
 
    # Compute policy gradient
    policy_gradient = compute_policy_gradient(states, actions, returns)
 
    # Update policy parameters
    update_policy_parameters(policy_gradient)
 
# Final policy
final_policy_parameters = policy_parameters
 
# Print final policy parameters
print("Final Policy Parameters:", final_policy_parameters)

Output:

Final Policy Parameters: [0.01854423 0.63611265 0.73294125]

Suggest improvement

Learn-One-Rule Algorithm

Share your thoughts in the comments