Open In App

Bellman Equation

Last Updated : 27 Sep, 2021
Like Article

According to the Bellman Equation, long-term- reward in a given action is equal to the reward from the current action combined with the expected reward from the future actions taken at the following time. Let’s try to understand first.

Let’s take an example:

Here we have a maze which is our environment and the sole goal of our agent is to reach the trophy state (R = 1) or to get Good reward and to avoid the fire state because it will be a failure (R = -1) or will get Bad reward.

Fig: Without Bellman Equation

What happens without Bellman Equation?

Initially, we will give our agent some time to explore the environment and let it figure out a path to the goal. As soon as it reaches its goal, it will back trace its steps back to its starting position and mark values of all the states which eventually leads towards the goal as V = 1.

The agent will face no problem until we change its starting position, as it will not be able to find a path towards the trophy state since the value of all the states is equal to 1. So, to solve this problem we should use Bellman Equation:

V(s)=maxa(R(s,a)+ γV(s’))


State(s): current state where the agent is in the environment

Next State(s’): After taking action(a) at state(s) the agent reaches s’

Value(V): Numeric representation of a state which helps the agent to find its path. V(s) here means the value of the state s.

Reward(R): treat which the agent gets after performing an action(a).

  • R(s): reward for being in the state s
  • R(s,a): reward for being in the state and performing an action a
  • R(s,a,s’): reward for being in a state s, taking an action a and ending up in s’

e.g. Good reward can be +1, Bad reward can be -1, No reward can be 0.

Action(a): set of possible actions that can be taken by the agent in the state(s). e.g. (LEFT, RIGHT, UP, DOWN)

Discount factor(γ): determines how much the agent cares about rewards in the distant future relative to those in the immediate future. It has a value between 0 and 1. Lower value encourages short–term rewards while higher value promises long-term reward

Fig: Using Bellman Equation

The max denotes the most optimum action among all the actions that the agent can take in a particular state which can lead to the reward after repeating this process every consecutive step.  

For example:

  • The state left to the fire state (V = 0.9) can go UP, DOWN, RIGHT but NOT LEFT because it’s a wall(not accessible). Among all these actions available the maximum value for that state is the UP action.
  • The current starting state of our agent can choose any random action UP or RIGHT since both lead towards the reward with the same number of steps.

By using the Bellman equation our agent will calculate the value of every step except for the trophy and the fire state (V = 0), they cannot have values since they are the end of the maze.

So, after making such a plan our agent can easily accomplish its goal by just following the increasing values.

Previous Article
Next Article

Similar Reads

SciPy - Integration of a Differential Equation for Curve Fit
In Machine Learning, often what we do is gather data, visualize it, then fit a curve in the graph and then predict certain parameters based on the curve fit. If we have some theoretical data we can use curve fitting from the verified dataset to extract the equation and verify it. So to find the equation of a curve of any order be it linear, quadrat
2 min read
ML | Normal Equation in Linear Regression
We know the Linear Regression model is a parameterized model which means that the model's behavior and predictions are determined by a set of parameters or coefficients in the model. However, we use different methods for finding these parameters which give the lowest error on our dataset. In this article, we will read one such article which is the
4 min read
Difference between Gradient descent and Normal equation
In regression models, our objective is to discover a model that can make predictions that closely resemble the actual target values. Basically, we try to find the parameters of the model which support our objective of the best model. The general behind finding this parameter is that we calculate the error between our actual value and predicted valu
4 min read
Multiple Linear Regression Model with Normal Equation
Prerequisite: NumPy Consider a data set, area (x1)rooms (x2)age (x3)price (y)2338656215274569244968972954756231768234253107485 let us consider, Here area, rooms, age are features / independent variables and price is the target / dependent variable. As we know the hypothesis for multiple linear regression is given by: [Tex]$h_{\theta}(x)=\theta_{0}
3 min read
How to Find Slope in Regression Equation?
Answer: The slope in a regression equation can be found by calculating the coefficient associated with the independent variable(s) in the regression model.To find the slope in a regression equation, you typically perform a regression analysis, which estimates the relationship between variables in a dataset. Here's how you can find the slope in deta
2 min read
Handwritten Equation Solver in Python
Acquiring Training Data Downloading Dataset Download the dataset from Kaggle.. Extract the zip file. There will be different folders containing images for different maths symbol. For simplicity, use 0–9 digits, +, ?-?and, times images in our equation solver. On observing the dataset, we can see that it is biased for some of the digits/symbols, as i
4 min read
Practice Tags :