Asynchronous Advantage Actor Critic (A3C) algorithm

The Asynchronous Advantage Actor Critic (A3C) algorithm is one of the newest algorithms to be developed under the field of Deep Reinforcement Learning Algorithms. This algorithm was developed by Google’s DeepMind which is the Artificial Intelligence division of Google. This algorithm was first mentioned in 2016 in a research paper appropriately named Asynchronous Methods for Deep Learning.

Decoding the different parts of the algorithm’s name:-

  • Asynchronous: Unlike other popular Deep Reinforcement Learning algorithms like Deep Q-Learning which uses a single agent and a single environment, This algorithm uses multiple agents with each agent having its own network parameters and a copy of the environment. This agents interact with their respective environments Asynchronously, learning with each interaction. Each agent is controlled by a global network. As each agent gains more knowledge, it contributes to the total knowledge of the global network. The presence of a global network allows each agent to have more diversified training data. This setup mimics the real-life environment in which humans live as each human gains knowledge from the experiences of some other human thus allowing the whole “global network” to be better.
  • Actor-Critic: Unlike some simpler techniques which are based on either Value-Iteration methods or Policy-Gradient methods, the A3C algorithm combines the best parts of both the methods ie the algorithm predicts both the value function V(s) as well as the optimal policy function \pi (s). The learning agent uses the value of the Value function (Critic) to update the optimal policy function (Actor). Note that here the policy function means the probabilistic distribution of the action space. To be exact, the learning agent determines the conditional probability P(a|s ;\theta) ie the parametrized probability that the agent chooses the action a when in state s.

Advantage: Typically in the implementation of Policy Gradient, the value of Discounted Returns(\gamma r) to tell the agent which of it’s actions were rewarding and which ones were penalized. By using the value of Advantage instead, the agent also learns how much better the rewards were than it’s expectation. This gives a new-found insight to the agent into the environment and thus the learning process is better. The advantage metric is given by the following expression:-

Advantage: A = Q(s, a) – V(s)

The following pseudo-code is referred from the research paper linked above.

Define global shared parameter vectors \theta and \theta _{v}
Define global shared counter T = 0
Define thread specific parameter vectors \theta ' and \theta _{v}'
Define thread step counter t = 1
    d\theta = 0
    d\theta _{v} = 0
    \theta ' = \theta
    \theta '_{v} = \theta _{v}
    t_{start} = t
    s = s_{t}
    while(s_{t} is not terminal t-t_{start} < t_{max})
        Simulate action a_{t} according to \pi (a_{t}|s;\theta )
        Receive reward r_{t} and next state s_{t+1}
    if(s_{t} is terminal)
        R = 0
        R = V(s_{t}, \theta _{v}')
        R = r_{i} + \gamma R

        d\theta = d\theta + \Delta _{\theta '}log(\pi (a_{i}|s{i};\theta ')(R-V(s_{i};\theta _{v}')))
        d\theta _{v}= d\theta _{v} + \frac{\partial ((R-V(s_{i};\theta _{v}'))^{2})}{\partial \theta _{v}'}
    \theta = \theta + d\theta
    \theta _{v}= \theta + d\theta _{v}


T_{max} – Maximum number of iterations

d\theta – change in global parameter vector

R – Total Reward

\pi – Policy function

V – Value function

\gamma – discount factor


  • This algorithm is faster and more robust than the standard Reinforcement Learning Algorithms.
  • It performs better than the other Reinforcement learning techniques because of the diversification of knowledge as explained above.
  • It can be used on discrete as well as continuous action spaces.

My Personal Notes arrow_drop_up

Check out this Author's contributed articles.

If you like GeeksforGeeks and would like to contribute, you can also write an article using or mail your article to See your article appearing on the GeeksforGeeks main page and help other Geeks.

Please Improve this article if you find anything incorrect by clicking on the "Improve Article" button below.