Q-learning Mathematical Background

Prerequisites: Q-Learning.

In the following derivations, the symbols defined as in the prerequisite article will be used.
The Q-learning technique is based on the Bellman Equation.

v(s) = E(R_{t+1}+\lambda v(S_{t+1})|S_{t}=s)
where,
E : Expectation
t+1 : next state
\lambda : discount factor



Rephrasing the above equation in the form of Q-Value:-

Q^{\pi}(s,a) = E(r_{t+1}+\lambda r_{t+2}+\lambda ^{2}r_{t+3}+...|S_{t}=s,A_{t}=a)

= E_{s'}(r_{t}+\lambda Q^{\pi}(s',a')|S_{t}=s,A_{t}=a)

The optimal Q-value is given by

Q^{*}(s,a) = E_{s'}(r_{t}+\lambda max_{a'}Q^{*}(s',a')|S_{t}=s,A_{t}=a)

Policy Iteration: It is the process of determining the optimal policy for the model and consists of the following two steps:-

  1. Policy Evaluation: This process estimates the value of the long-term reward function with the greedy policy obtained from the last Policy Improvement step.
  2. Policy Improvement: This process updates the policy with the action that maximizes V for each of the state. This process is repeated until convergence is achieved.

Steps Involved:-

  • Initialization:

    V(s) = any real random number
    \pi(s) = any A(s) arbitrarily chosen

  • Policy Evaluation:
    \Delta = 0
    while(\Delta > \theta)
    {
        for each s in S
        {    
            v = V(s)
            V(s) = \sum_{s',r}(p(s',r|s,\pi (s))(r+\lambda V(s')))
            \Delta = max(\Delta ,|v-V(s)|)
        }
    }
    
    \theta \rightarrow 0^{+}
    
    
  • Policy Improvement:
    isPolicyStable = true
    while(true)
        for each s in S
        {
            a = \pi (s)
            \pi (s) = 
            if(a\neq \pi (s))
                isPolicyStable = false
            if(isPolicyStable == true)
                break from both loops
        }
    return V,\pi
    
  • Value Iteration: This process updates the function V according to the Optimal Bellman Equation.

    v_{*}(s) = max_{a}E(R_{t+1}+\gamma v_{*}(S_{t+1})|S_{t}=s,A_{t}=a)

Working Steps:

  • Initialization: Initialize array V by any random real number.
  • Computing the optimal value:
    \Delta = 0
    while(\Delta > \theta)
    {
        for each s in S
        {
            v = V(s)
            V(s) = \sum_{s',r}(p(s',r|s,\pi (s))(r+\lambda V(s')))
            \Delta = max(\Delta ,|v-V(s)|)
        }
    }
    
    \pi (s) = argmax_{a}\sum _{s',r}(p(s',r|s,a)(r+\gamma V(s')))
    return \pi
    


My Personal Notes arrow_drop_up

Check out this Author's contributed articles.

If you like GeeksforGeeks and would like to contribute, you can also write an article using contribute.geeksforgeeks.org or mail your article to contribute@geeksforgeeks.org. See your article appearing on the GeeksforGeeks main page and help other Geeks.

Please Improve this article if you find anything incorrect by clicking on the "Improve Article" button below.




Article Tags :
Practice Tags :


Be the First to upvote.


Please write to us at contribute@geeksforgeeks.org to report any issue with the above content.