**Prerequisites:** Q-Learning.

In the following derivations, the symbols defined as in the prerequisite article will be used.

The Q-learning technique is based on the **Bellman Equation**.

where,

**E : Expectation
t+1 : next state
: discount factor**

Rephrasing the above equation in the form of Q-Value:-

The **optimal Q-value** is given by

**Policy Iteration:** It is the process of determining the optimal policy for the model and consists of the following two steps:-

**Policy Evaluation:**This process estimates the value of the long-term reward function with the greedy policy obtained from the last Policy Improvement step.**Policy Improvement:**This process updates the policy with the action that maximizes V for each of the state. This process is repeated until convergence is achieved.

**Steps Involved:-**

**Initialization:**= any real random number

= any A(s) arbitrarily chosen**Policy Evaluation:****while()**{**for each s in S**{ } }**Policy Improvement:****while(true)****for each s in S**{**if()****if()****break from both loops**}**return V,****Value Iteration:**This process updates the function V according to the**Optimal Bellman Equation**.

**Working Steps:**

**Initialization:**Initialize array V by any random real number.**Computing the optimal value:****while()**{**for each s in S**{ } }**return**