Open In App

A Brief Introduction to Proximal Policy Optimization

Last Updated : 14 Feb, 2022
Like Article

Proximal Policy Optimisation (PPO) is a recent advancement in the field of Reinforcement Learning, which provides an improvement on Trust Region Policy Optimization (TRPO). This algorithm was proposed in 2017, and showed remarkable performance when it was implemented by OpenAI. To understand and appreciate the algorithm, we first need to understand what a policy is.

Note: This post is ideally targeted towards people who have a fairly basic understanding of Reinforcement Learning.

A policy, in Reinforcement Learning terminology, is a mapping from action space to state space. It can be imagined to be instructions for the RL agent, in terms of what actions it should take based upon which state of the environment it is currently in. When we talk about evaluating an agent, we generally mean evaluating the policy function to find out how well the agent is performing, following the given policy. This is where Policy Gradient methods play a vital role. When an agent is “learning” and doesn’t really know which actions yield the best result in the corresponding states, it does so by calculating the policy gradients. It works like a neural network architecture, whereby the gradient of the output, i.e, the log of probabilities of actions in that particular state, is taken with respect to parameters of the environment and the change is reflected in the policy, based upon the gradients.

While this tried and tested method works well, the major disadvantages with these methods is their hypersensitivity to hyperparameter tuning such as choice of stepsize, learning rate, etc , along with their poor sample efficiency. Unlike supervised learning which has a guaranteed route to success or convergence with relatively less hyperparameter tuning, reinforcement learning is a lot more complex with various moving parts that need to be considered. PPO aims to strike a balance between important factors like ease of implementation, ease of tuning, sample complexity,sample efficiency and trying to compute an update at each step that minimizes the cost function while ensuring the deviation from the previous policy is relatively small. PPO is in fact, a policy gradient method that learns from online data as well. It merely ensures that the updated policy isn’t too much different from the old policy to ensure low variance in training. The most common implementation of PPO is via the Actor-Critic Model which uses 2 Deep Neural Networks, one taking the action(actor) and the other handles the rewards(critic). The mathematical equation of PPO is shown below:

L^{CLIP}(\theta) = \bar E_{t}[min(r_t(\theta)\bar A_t,clip(r_t(\theta),1-\epsilon,1+\epsilon)\bar A_t)] \\ \\ where,\\ \theta\:is\:the\:policy\:parameter\\ \bar E_t\:denotes\:the\:empirical\:expectation\:over\:timesteps\\ r_t\:denotes\:the\:ratio\:of\:the\:probabilities\:under\:the\:new\:and\:old\:policies\:respectively\:(also\:known\:as\:Importance\:Sampling\:Ratio)\\ \bar A_t\:is\:the\:estimated\:advantage\:at\:time\:t\\ \epsilon \:is\:a\:hyperparameter,\:usually\:0.1\:or\:0.2.

The following important inferences can be drawn from the PPO equation:

  • It is a policy gradient optimization algorithm, that is, in each step there is an update to an existing policy to seek improvement on certain parameters
  • It ensures that the update is not too large, that is the old policy is not too different from the new policy (it does so by essentially “clipping” the update region to a very narrow range)
  • Advantage function is the difference between the future discounted sum of rewards on a certain state and action, and the value function of that policy.
  • Importance Sampling ratio, or the ratio of the probability under the new and old policies respectively, is used for update
  • ε is a hyperparameter denotes the limit of the range within which the update is allowed

This is how the working PPO algorithm looks, in it’s entirety when implemented in Actor-Critic style:

Algorithm\:for\:PPO\:using\:Actor-Critic\:implementation: \\ Input:\:initial\:policy\:paramters\\ for\:iteration=1,2,\\ \:\:for\:actor=1,2,\\ \:\:\:\:Run\:policy\:\pi_{\theta_{old}}\:in\:environment\:for\:T\:timesteps\\ \:\:\:\:Compute\:Advantage\:estimates\:\bar A_1,...,\bar A_t\\ \:\:end\:for\\ \:\:Optimize\:surrogate\:L\:wrt\:\theta ,\:with\:K\:epochs\:and\:minibatch\:size\:M<=NT\\ \:\:\theta_{old}\:\leftarrow \:\theta\\ end\:for

What we can observe, is that small batches of observation are used for updation, and then thrown away in order to incorporate new a new batch of observations,aka “minibatch”. The updated policy will be ε-clipped to a small region so as to not allow huge updates which might potentially be irrecoverably harmful. In short, PPO behaves exactly like other policy gradient methods in the sense that it also involves the calculation of output probabilities in the forward pass based on various parameters and calculating the gradients to improve those decisions or probabilities in the backward pass. It involves the usage of importance sampling ration like it’s predecessor, TRPO. However, it also ensures that the old policy and new policy are at least at a certain proximity (denoted by ε), and very large updates are not allowed. It has become one of the most widely used policy optimization algorithms in the field of reinforcement learning.

Like Article
Suggest improvement
Share your thoughts in the comments

Similar Reads