Open In App

Understanding Prioritized Experience Replay

Last Updated : 29 Apr, 2024
Improve
Improve
Like Article
Like
Save
Share
Report

Reinforcement learning’s Prioritized Experience Replay (PER) is an improvement over the conventional experience replay. To carefully sample and replay events according to their importance in enhancing the learning process, presents a priority system. PER prioritizes each experience according to the size of the temporal difference error instead of randomly selecting all experiences.

The model may concentrate on important experiences thanks to this prioritized sampling, which speeds up learning by highlighting difficult and instructive transitions. In complex and dynamic contexts, in particular, PER helps stabilize training, increase sample efficiency, and enhance the overall performance of reinforcement learning systems.

In this discussion, we’ll delve into Prioritized Experience Replay, covering its benefits and providing illustrative code examples.

What is Prioritised Experience Replay?

Prioritized Experience Replay is a reinforcement learning technique that diverges from random sample selection by prioritizing experiences based on their significance. Instead of random replay, it focuses on pivotal learning moments, akin to a student emphasizing challenging exercises. Experiences are prioritized by their impact on the agent’s behavior, measured through ‘temporal difference error.’ This prioritization ensures efficient learning, emphasizing the value of experiences that challenge and refine the agent’s strategy. In reinforcement learning, the technique shifts from random sampling to prioritize experiences with unexpected outcomes or inaccurate predictions. The prioritization, often based on prediction errors, aids in correcting mistakes and enhancing learning effectiveness. Mathematically, priority is proportional to the error, with adjustments to prevent experiences from having a zero chance of replay.

Mathematical Concepts :

In Prioritized Experience Replay (PER), mathematical reasoning guides the AI to focus on key past experiences for optimal learning. Similar to selecting unique jelly bean flavors, PER assigns a ‘priority score’ based on learning potential. Instances that surprise the AI, deviating from predictions, receive higher scores, quantified by the ‘temporal difference error.’ This system enables the AI to efficiently prioritize experiences, streamlining the learning process by revisiting those most likely to enhance decision-making.

Formula :

P_i = \left|\text{TD error}_i\right| + \epsilon

Here, ( P_i) is the priority of the i-th experience, with ( \text{TD error}_i ) as the associated temporal difference error. The absolute value accommodates negative errors, and ( \epsilon ) is a small positive constant preventing experiences from having zero replay chances. The priority dictates the probability of selecting an experience for replay during learning.

How Prioritized Experience Replay Works ?

Prioritized Experience Replay (PER) in artificial intelligence is akin to a diligent student strategically focusing on impactful learning. It introduces a systematic approach, akin to a fluorescent marker highlighting crucial experiences for scrutiny. Notable experiences undergo thorough examination, guiding the AI’s learning journey with deliberate choices based on educational value. Think of PER as an advanced study technique where the AI selectively revisits challenging material, optimizing its performance, much like a student prioritizing difficult subjects.

Prioritized Experience Replay Example:

In this example the below code sets up a replay buffer for reinforcement learning using the CartPole environment from OpenAI Gym. It initializes a buffer with a maximum size and simulates experiences by taking random actions in the environment. The experiences, including state, action, reward, next state, done, and a simulated temporal difference error, are stored in the buffer. The buffer uses prioritization based on the absolute temporal difference error when sampling a batch for training. The sampled experiences are then printed. This implementation follows a prioritized experience replay strategy to enhance the learning process.

Python

import numpy as np
import gym
  
# Suppress gym warnings
gym.logger.set_level(40)
  
# Create the environment with the new step API
env = gym.make('CartPole-v1')
  
  
class ReplayBuffer:
    def __init__(self, size):
        self.size = size
        self.buffer = []
  
    def add(self, experience):
        if len(self.buffer) >= self.size:
            self.buffer.pop(0)
        self.buffer.append(experience)
  
    def sample(self, batch_size):
        # Assuming experience[2] is the temporal difference error
        priorities = [abs(experience[2]) for experience in self.buffer]
        probabilities = priorities / np.sum(priorities)
        sample_indices = np.random.choice(
            range(len(self.buffer)), size=batch_size, p=probabilities)
        return [self.buffer[i] for i in sample_indices]
  
  
# Initialize replay buffer with a maximum size
replay_buffer = ReplayBuffer(1000)
  
# Simulated experience tuple: (state, action, reward, next_state, done, truncated)
for _ in range(100):
    state = env.reset()
    done = truncated = False
    while not done and not truncated:
        action = env.action_space.sample()  # Replace with your agent's action
        next_state, reward, done, truncated, _ = env.step(action)
        # Simulated temporal difference error (replace with actual calculation)
        td_error = np.random.random()
        replay_buffer.add([state, action, reward, next_state, done, td_error])
        state = next_state
  
# Sample from the replay buffer
batch_size = 32
experiences = replay_buffer.sample(batch_size)
  
# Print the sampled experiences
for state, action, reward, next_state, done, td_error in experiences:
    print(f"""\nState :{state},
    Action    :{action},
    Reward    :{reward},
    Next_state:{next_state},
    Done      :{done},
    TD Error  :{td_error}""")
    print('*'*50)

                    

Output:


State :[ 0.11591834 0.77425534 -0.14344667 -1.2312824 ],
Action :0,
Reward :1.0,
Next_state:[ 0.13140345 0.5812397 -0.16807233 -0.9867614 ],
Done :False,
TD Error :0.7477144260668369
**************************************************

State :(array([ 0.01208125, -0.02053603, 0.0128395 , -0.02046756], dtype=float32), {}),
Action :1,
Reward :1.0,
Next_state:[ 0.01167053 0.17439947 0.01243015 -0.309072 ],
Done :False,
TD Error :0.46852288691261335
**************************************************

State :[ 0.02144278 -0.3443566 -0.00629879 0.4912028 ],
Action :1,
Reward :1.0,
Next_state:[ 0.01455565 -0.14914636 0.00352527 0.19654144],
Done :False,
TD Error :0.7740682467164754
**************************************************

State :[-0.06698377 -0.2678352 0.19415861 0.88250273],
Action :1,
Reward :1.0,
Next_state:[-0.07234047 -0.07580476 0.21180867 0.65659404],
Done :True,
TD Error :0.11496709236515668
**************************************************

State :[ 0.14833789 1.1753476 -0.20789757 -1.8135685 ],
Action :1,
Reward :1.0,
Next_state:[ 0.17184484 1.3720902 -0.24416894 -2.1630104 ],
Done :True,
TD Error :0.9478533953446938
**************************************************

State :[-0.04076329 -0.18081008 0.0930106 0.55591196],
Action :1,
Reward :1.0,
Next_state:[-0.0443795 0.01289139 0.10412884 0.29392135],
Done :False,
TD Error :0.7132875819667918
**************************************************

State :[-0.02822679 -0.17295691 0.04501294 0.3808615 ],
Action :1,
Reward :1.0,
Next_state:[-0.03168593 0.02149792 0.05263017 0.10270403],
Done :False,
TD Error :0.7377047532579617
**************************************************

State :[-0.09415992 -0.16580078 0.17181695 0.5425806 ],
Action :0,
Reward :1.0,
Next_state:[-0.09747594 -0.362868 0.18266855 0.884095 ],
Done :False,
TD Error :0.1993198955724116
**************************************************

State :[ 0.01934518 0.23324965 0.03319921 -0.28958732],
Action :1,
Reward :1.0,
Next_state:[ 0.02401018 0.42788285 0.02740747 -0.5716174 ],
Done :False,
TD Error :0.7493026368547009
**************************************************

State :[-0.33616763 -0.6907772 -0.02843124 -0.0226175 ],
Action :1,
Reward :1.0,
Next_state:[-0.3499832 -0.49525928 -0.02888359 -0.3241335 ],
Done :False,
TD Error :0.788057872928113
**************************************************

State :[ 0.00532935 0.39540118 -0.03692627 -0.6400925 ],
Action :1,
Reward :1.0,
Next_state:[ 0.01323737 0.59101796 -0.04972812 -0.9441715 ],
Done :False,
TD Error :0.8672773210927185
**************************************************

State :[-0.03596839 -0.1296449 -0.06337826 -0.36678526],
Action :0,
Reward :1.0,
Next_state:[-0.03856128 -0.32381168 -0.07071396 -0.09474061],
Done :False,
TD Error :0.8288382931813617
**************************************************

State :[-0.01380879 -0.6352626 0.06954668 0.95111096],
Action :1,
Reward :1.0,
Next_state:[-0.02651404 -0.4411421 0.0885689 0.68106437],
Done :False,
TD Error :0.1832674032881244
**************************************************

State :[ 0.02991444 0.00881486 -0.14392613 -0.3922633 ],
Action :1,
Reward :1.0,
Next_state:[ 0.03009074 0.2056547 -0.1517714 -0.72663856],
Done :False,
TD Error :0.8001455596540854
**************************************************

State :[-0.00633797 0.00784681 0.02946823 -0.04063726],
Action :0,
Reward :1.0,
Next_state:[-0.00618103 -0.18768504 0.02865549 0.2611956 ],
Done :False,
TD Error :0.20537150374795687
**************************************************

State :[-0.13672931 -0.38212386 0.0118483 0.23635693],
Action :1,
Reward :1.0,
Next_state:[-0.1443718 -0.18717316 0.01657544 -0.05256526],
Done :False,
TD Error :0.826033385795174
**************************************************

State :[ 0.0200416 -0.24002768 -0.02207134 0.30453306],
Action :0,
Reward :1.0,
Next_state:[ 0.01524105 -0.43482825 -0.01598068 0.5901743 ],
Done :False,
TD Error :0.9848919694358678
**************************************************

State :[-0.01162592 0.15608896 -0.02377166 -0.3539639 ],
Action :1,
Reward :1.0,
Next_state:[-0.00850414 0.3515407 -0.03085094 -0.65404695],
Done :False,
TD Error :0.05813761546070639
**************************************************

State :[ 0.04533144 0.7859577 -0.05289783 -1.1303552 ],
Action :0,
Reward :1.0,
Next_state:[ 0.0610506 0.5915668 -0.07550494 -0.8547215 ],
Done :False,
TD Error :0.2546316813400499
**************************************************

State :[-0.07075828 -0.94547755 0.16721295 1.7364645 ],
Action :0,
Reward :1.0,
Next_state:[-0.08966783 -1.142065 0.20194224 2.0761647 ],
Done :False,
TD Error :0.8990279149950086
**************************************************

State :[-0.06678355 -0.06583953 0.17092654 0.4289005 ],
Action :1,
Reward :1.0,
Next_state:[-0.06810035 0.12650189 0.17950454 0.19460076],
Done :False,
TD Error :0.24030111070520677
**************************************************

State :[ 0.04645281 0.40705928 -0.20409918 -1.1720896 ],
Action :0,
Reward :1.0,
Next_state:[ 0.05459399 0.21508919 -0.22754097 -0.94970065],
Done :True,
TD Error :0.21294479960345125
**************************************************

State :[-0.04055458 0.39916432 -0.1256617 -1.1495962 ],
Action :1,
Reward :1.0,
Next_state:[-0.0325713 0.5956821 -0.14865363 -1.4788959 ],
Done :False,
TD Error :0.03649163471194816
**************************************************

State :[ 0.00468635 -0.5395449 0.01726278 0.7853578 ],
Action :1,
Reward :1.0,
Next_state:[-0.00610455 -0.34466434 0.03296994 0.4981555 ],
Done :False,
TD Error :0.7528579190537328
**************************************************

State :[-0.12131985 -0.06166982 0.19501229 0.3884462 ],
Action :1,
Reward :1.0,
Next_state:[-0.12255324 0.13022701 0.20278122 0.1630279 ],
Done :False,
TD Error :0.1023526706124891
**************************************************

State :[-0.0721189 0.16941373 -0.00794339 -0.53338546],
Action :1,
Reward :1.0,
Next_state:[-0.06873062 0.3646465 -0.0186111 -0.8285607 ],
Done :False,
TD Error :0.4646957392043991
**************************************************

State :[ 0.00639658 -0.19945058 -0.06349543 0.19460389],
Action :1,
Reward :1.0,
Next_state:[ 0.00240757 -0.00348054 -0.05960335 -0.11741393],
Done :False,
TD Error :0.20289915014168036
**************************************************

State :[-0.0599728 -0.56426334 0.10770321 1.1660558 ],
Action :1,
Reward :1.0,
Next_state:[-0.07125807 -0.3706952 0.13102433 0.9089894 ],
Done :False,
TD Error :0.8860488886451029
**************************************************

State :[-0.00165101 0.19416445 -0.06194754 -0.46632877],
Action :1,
Reward :1.0,
Next_state:[ 0.00223228 0.39010447 -0.07127412 -0.77787596],
Done :False,
TD Error :0.732040887014864
**************************************************

State :[ 0.00930487 0.04312118 -0.00657871 -0.06197907],
Action :0,
Reward :1.0,
Next_state:[ 0.0101673 -0.15190583 -0.0078183 0.22862099],
Done :False,
TD Error :0.28260998921244374
**************************************************

State :[-0.03688454 -0.43363672 0.00842516 0.49306697],
Action :0,
Reward :1.0,
Next_state:[-0.04555727 -0.6288765 0.0182865 0.7883932 ],
Done :False,
TD Error :0.3715264709799807
**************************************************

State :[-0.01834441 -1.1446345 0.18673103 1.8109102 ],
Action :1,
Reward :1.0,
Next_state:[-0.0412371 -0.9520205 0.22294922 1.5815922 ],
Done :True,
TD Error :0.97482473141418
**************************************************

Output Explanation:

The output of the above code consists of 32 sampled experience tuples from a simulated reinforcement learning scenario using the CartPole environment and a replay buffer. Each tuple includes information about the current state, action taken, received reward, subsequent state, a boolean indicating if the episode is finished, and a simulated temporal difference error. The content of the experiences varies due to random actions and temporal difference errors, and the sampling prioritizes experiences with higher temporal difference errors, reflecting a prioritized experience replay strategy.

Advantages of Prioritized Experience Replay

Prioritized Experience Replay (PER) offers several advantages in reinforcement learning:

  1. Efficient Learning: PER prioritizes experiences based on their significance, allowing the agent to focus on the most informative and impactful events. This targeted approach enhances learning efficiency by emphasizing experiences with higher learning potential.
  2. Improved Sample Efficiency: By prioritizing experiences with higher temporal difference errors, PER directs the learning algorithm to focus on instances where the agent’s predictions deviate significantly from reality. This targeted sampling improves sample efficiency, enabling the agent to learn more effectively from critical events.
  3. Enhanced Generalization: Prioritizing experiences based on their importance promotes better generalization. The agent learns to adapt more effectively to a broader range of scenarios, as it concentrates on experiences that challenge and refine its decision-making abilities, leading to improved overall performance.
  4. Faster Convergence: The selective replay of prioritized experiences accelerates the learning process by emphasizing crucial moments that contribute the most to the agent’s knowledge. This results in faster convergence towards an optimal policy, reducing the time required for the learning algorithm to achieve high-performance levels.
  5. Adaptability to Task Complexity:PER allows the agent to adapt to the complexity of the learning task by focusing on experiences that are more challenging or unexpected. This adaptability is particularly beneficial in environments with dynamic and varying conditions, as the agent learns to prioritize experiences that contribute most effectively to its evolving understanding of the task.

Conclusion

In summary, Prioritized Experience Replay (PER) significantly improves reinforcement learning by prioritizing informative experiences. This targeted approach enhances sample efficiency, accelerates convergence, and promotes better generalization. PER’s adaptability to task complexity ensures effective learning in dynamic environments. Overall, it stands as a powerful tool, optimizing the learning process and advancing the capabilities of reinforcement learning algorithms.



Like Article
Suggest improvement
Share your thoughts in the comments

Similar Reads