Related Articles

Related Articles

Q-learning Mathematical Background
  • Last Updated : 18 Jun, 2019

Prerequisites: Q-Learning.

In the following derivations, the symbols defined as in the prerequisite article will be used.
The Q-learning technique is based on the Bellman Equation.

v(s) = E(R_{t+1}+\lambda v(S_{t+1})|S_{t}=s)
where,
E : Expectation
t+1 : next state
\lambda : discount factor

Rephrasing the above equation in the form of Q-Value:-

Q^{\pi}(s,a) = E(r_{t+1}+\lambda r_{t+2}+\lambda ^{2}r_{t+3}+...|S_{t}=s,A_{t}=a)



= E_{s'}(r_{t}+\lambda Q^{\pi}(s',a')|S_{t}=s,A_{t}=a)

The optimal Q-value is given by

Q^{*}(s,a) = E_{s'}(r_{t}+\lambda max_{a'}Q^{*}(s',a')|S_{t}=s,A_{t}=a)

Policy Iteration: It is the process of determining the optimal policy for the model and consists of the following two steps:-

  1. Policy Evaluation: This process estimates the value of the long-term reward function with the greedy policy obtained from the last Policy Improvement step.
  2. Policy Improvement: This process updates the policy with the action that maximizes V for each of the state. This process is repeated until convergence is achieved.

Steps Involved:-

  • Initialization:

    V(s) = any real random number
    \pi(s) = any A(s) arbitrarily chosen

  • Policy Evaluation:
    \Delta = 0
    while(\Delta > \theta)
    {
        for each s in S
        {    
            v = V(s)
            V(s) = \sum_{s',r}(p(s',r|s,\pi (s))(r+\lambda V(s')))
            \Delta = max(\Delta ,|v-V(s)|)
        }
    }
    
    \theta \rightarrow 0^{+}
    
    
  • Policy Improvement:
    isPolicyStable = true
    while(true)
        for each s in S
        {
            a = \pi (s)
            \pi (s) = 
            if(a\neq \pi (s))
                isPolicyStable = false
            if(isPolicyStable == true)
                break from both loops
        }
    return V,\pi
    
  • Value Iteration: This process updates the function V according to the Optimal Bellman Equation.

    v_{*}(s) = max_{a}E(R_{t+1}+\gamma v_{*}(S_{t+1})|S_{t}=s,A_{t}=a)

Working Steps:

  • Initialization: Initialize array V by any random real number.
  • Computing the optimal value:
    \Delta = 0
    while(\Delta > \theta)
    {
        for each s in S
        {
            v = V(s)
            V(s) = \sum_{s',r}(p(s',r|s,\pi (s))(r+\lambda V(s')))
            \Delta = max(\Delta ,|v-V(s)|)
        }
    }
    
    \pi (s) = argmax_{a}\sum _{s',r}(p(s',r|s,a)(r+\gamma V(s')))
    return \pi
    

machine-learning




My Personal Notes arrow_drop_up
Recommended Articles
Page :