# Q-learning Mathematical Background

• Difficulty Level : Hard
• Last Updated : 18 Jun, 2019

Prerequisites: Q-Learning.

In the following derivations, the symbols defined as in the prerequisite article will be used.
The Q-learning technique is based on the Bellman Equation.

where,
E : Expectation
t+1 : next state
: discount factor

Rephrasing the above equation in the form of Q-Value:-

The optimal Q-value is given by

Policy Iteration: It is the process of determining the optimal policy for the model and consists of the following two steps:-

1. Policy Evaluation: This process estimates the value of the long-term reward function with the greedy policy obtained from the last Policy Improvement step.
2. Policy Improvement: This process updates the policy with the action that maximizes V for each of the state. This process is repeated until convergence is achieved.

Steps Involved:-

• Initialization:

= any real random number
= any A(s) arbitrarily chosen

• Policy Evaluation:

while()
{
for each s in S
{

}
}


• Policy Improvement:

while(true)
for each s in S
{

if()

if()
break from both loops
}
return V,

• Value Iteration: This process updates the function V according to the Optimal Bellman Equation.

Working Steps:

• Initialization: Initialize array V by any random real number.
• Computing the optimal value:

while()
{
for each s in S
{

}
}

return


My Personal Notes arrow_drop_up