# Q-learning Mathematical Background

Last Updated : 18 Jun, 2019
Prerequisites: Q-Learning. In the following derivations, the symbols defined as in the prerequisite article will be used. The Q-learning technique is based on the Bellman Equation. where, E : Expectation t+1 : next state : discount factor Rephrasing the above equation in the form of Q-Value:- The optimal Q-value is given by Policy Iteration: It is the process of determining the optimal policy for the model and consists of the following two steps:-
1. Policy Evaluation: This process estimates the value of the long-term reward function with the greedy policy obtained from the last Policy Improvement step.
2. Policy Improvement: This process updates the policy with the action that maximizes V for each of the state. This process is repeated until convergence is achieved.
Steps Involved:-
• Initialization: = any real random number = any A(s) arbitrarily chosen
• Policy Evaluation:

while()
{
for each s in S
{

}
}


• Policy Improvement:

while(true)
for each s in S
{

if()

if()
break from both loops
}
return V,

• Value Iteration: This process updates the function V according to the Optimal Bellman Equation.
Working Steps:
• Initialization: Initialize array V by any random real number.
• Computing the optimal value:

while()
{
for each s in S
{

}
}

return