Q-learning Mathematical Background Last Updated : 18 Jun, 2019 Improve Improve Like Article Like Save Share Report Prerequisites: Q-Learning. In the following derivations, the symbols defined as in the prerequisite article will be used. The Q-learning technique is based on the Bellman Equation. where, E : Expectation t+1 : next state : discount factor Rephrasing the above equation in the form of Q-Value:- The optimal Q-value is given by Policy Iteration: It is the process of determining the optimal policy for the model and consists of the following two steps:- Policy Evaluation: This process estimates the value of the long-term reward function with the greedy policy obtained from the last Policy Improvement step. Policy Improvement: This process updates the policy with the action that maximizes V for each of the state. This process is repeated until convergence is achieved. Steps Involved:- Initialization: = any real random number = any A(s) arbitrarily chosen Policy Evaluation: while() { for each s in S { } } Policy Improvement: while(true) for each s in S { if() if() break from both loops } return V, Value Iteration: This process updates the function V according to the Optimal Bellman Equation. Working Steps: Initialization: Initialize array V by any random real number. Computing the optimal value: while() { for each s in S { } } return Like Article Suggest improvement Next Mathematics concept required for Deep Learning Share your thoughts in the comments Add Your Comment Please Login to comment...