Open In App

Q-learning Mathematical Background

Last Updated : 18 Jun, 2019
Improve
Improve
Like Article
Like
Save
Share
Report
Prerequisites: Q-Learning. In the following derivations, the symbols defined as in the prerequisite article will be used. The Q-learning technique is based on the Bellman Equation. v(s) = E(R_{t+1}+\lambda v(S_{t+1})|S_{t}=s) where, E : Expectation t+1 : next state \lambda : discount factor Rephrasing the above equation in the form of Q-Value:- Q^{\pi}(s,a) = E(r_{t+1}+\lambda r_{t+2}+\lambda ^{2}r_{t+3}+...|S_{t}=s,A_{t}=a) = E_{s'}(r_{t}+\lambda Q^{\pi}(s',a')|S_{t}=s,A_{t}=a) The optimal Q-value is given by Q^{*}(s,a) = E_{s'}(r_{t}+\lambda max_{a'}Q^{*}(s',a')|S_{t}=s,A_{t}=a) Policy Iteration: It is the process of determining the optimal policy for the model and consists of the following two steps:-
  1. Policy Evaluation: This process estimates the value of the long-term reward function with the greedy policy obtained from the last Policy Improvement step.
  2. Policy Improvement: This process updates the policy with the action that maximizes V for each of the state. This process is repeated until convergence is achieved.
Steps Involved:-
  • Initialization: V(s) = any real random number \pi(s) = any A(s) arbitrarily chosen
  • Policy Evaluation:
    \Delta = 0
    while(\Delta > \theta)
    {
        for each s in S
        {    
            v = V(s)
            V(s) = \sum_{s',r}(p(s',r|s,\pi (s))(r+\lambda V(s')))
            \Delta = max(\Delta ,|v-V(s)|)
        }
    }
    
    \theta \rightarrow 0^{+}
    
    
  • Policy Improvement:
    isPolicyStable = true
    while(true)
        for each s in S
        {
            a = \pi (s)
            \pi (s) = 
            if(a\neq \pi (s))
                isPolicyStable = false
            if(isPolicyStable == true)
                break from both loops
        }
    return V,\pi
    
  • Value Iteration: This process updates the function V according to the Optimal Bellman Equation. v_{*}(s) = max_{a}E(R_{t+1}+\gamma v_{*}(S_{t+1})|S_{t}=s,A_{t}=a)
Working Steps:
  • Initialization: Initialize array V by any random real number.
  • Computing the optimal value:
    \Delta = 0
    while(\Delta > \theta)
    {
        for each s in S
        {
            v = V(s)
            V(s) = \sum_{s',r}(p(s',r|s,\pi (s))(r+\lambda V(s')))
            \Delta = max(\Delta ,|v-V(s)|)
        }
    }
    
    \pi (s) = argmax_{a}\sum _{s',r}(p(s',r|s,a)(r+\gamma V(s')))
    return \pi
    


Like Article
Suggest improvement
Share your thoughts in the comments

Similar Reads