# Sparse Rewards in Reinforcement Learning

• Last Updated : 21 Apr, 2022

Prerequisite: Understanding Reinforcement Learning in-depth

In the previous articles, we learned about reinforcement learning, as well as the general paradigm and the issues with sparse reward settings. In this article, we’ll dive a little further into some more technical work aimed at resolving the sparse reward setting problem. The fact that we’re dealing with sparse rewards means that we don’t know the target label that our network should create for each input frame, so our agent must learn from very sparse feedback and figure out what action sequences led to the end reward on its own. One option that has emerged from the research is to supplement the sparse extrinsic reward signal received from the environment with additional dense reward signals to enhance the agent’s learning. We’ll go over some fascinating technical work that introduces concepts such as additional incentive signals, curiosity-driven exploration, and experience playback in retrospect.

In order to overcome some of the most difficult challenges in reinforcement learning, a wide range of novel ideas in reinforcement learning research have emerged. One recent trend has been to supplement the in-game environment’s sparse extrinsic reward signal with additional feedback signals that aid your agent’s learning. Many of these new concepts are variants of the same core theme.

Instead of a sparse reward signal that your agent only sees on rare occasions, we want to construct extra feedback signals that are very rich, or in other words, we want to create a supervised setting. The purpose of those extra rewards, as well as the additional feedback signals, is that they are tied to the task that we want our agent to do in some way. We want to produce these dense feedback signals so that anytime our agent completes those tasks, it will most likely gain information or feature extractors that will be useful in the final work (the sparse tasks) that we are interested in. It is impossible to give you an in-depth explanation of all the methodologies in a single article, but this article will attempt to sketch a few highly intriguing papers that attempt to give you an idea of the main directions that research is now taking.

## Auxiliary Losses:

Architecture of Reinforcement Learning Agent with Unsupervised Auxiliary Task

In the majority of reinforcement learning situations, our agent is given some kind of unprocessed input data, such as image sequences. The agent will then utilize some sort of feature extraction pipeline to extract relevant information from those raw input photos, and it will then have a policy network that uses those derived features to do the task we intended to learn.

Our feedback signal can be so sparse in reinforcement learning that the agent never succeeds in extracting relevant characteristics from the input frames. So, in this scenario, adding additional learning goals to our agent that exploit the strengths of supervised learning to come up with extremely valuable feature extractors on those photos is a successful method. So let’s go through Google Deepmind’s article – reinforcement learning with unsupervised auxiliary tasks. The basic sparse reward signal is that our agent is walking around in a 3D Maze, looking for specified things, and it receives a reward anytime it encounters one of those objects. They do, however, supplement the entire training process with three additional reward signals, rather than having this relatively sparse feedback signal. The agent must initially learn what is known as pixel control. It uses the primary feature extraction process and learns a separate policy to maximize adjust the pixel intensities in particular regions of the input images given a frame from the environment. It may, for example, learn that looking up at the sky entirely affects all of the pixel values in the input. In their proposed implementation:

• the input frame is separated into a small number of grids, with each grid receiving a visual change score. The strategy is then trained to maximize the total visual change across all grids, with the goal of forcing the feature extractor to become more sensitive to the game’s overall dynamics. The agent is given three recent frames from the episode sequence and is tasked with predicting the reward that will be provided.
• In the second auxiliary job, which is called Reward Prediction. We’ve added another learning goal to our agent, this time to optimize the feature extraction pipeline in a way that we think will be generally useful for the end goal that we care about.
• The third job is Value Function Replay, which attempts to evaluate the value of being in a current state by projecting the entire future reward that the agent will receive from this moment forward. This is essentially what every off-policy algorithm, such as DQ n, does all of the time. It turns out that by adding these relatively easy extra goals to our training pipeline, we can dramatically improve our learning agent’s sampling efficiency.

In three-dimensional environments, the addition of pixel control activities appears to work particularly well. Learning to control your gaze direction and how this affects your own visual input signals is critical for learning any form of effective behaviour.

## Hindsight Experience Replay:

Hindsight Experience Replay is a fantastic addition to the usual reward system (HER). HER is based on a very simple concept that is quite successful. Consider the following scenario: you wish to teach a robotic arm to push an object on a table to a specified spot. The difficulty is that if you do this through random exploration, you’re very unlikely to obtain a lot of rewards, making it very difficult to train this policy. The general method is to provide a dense reward that is, for example, the object’s distance from the target point in Euclidean space. As a result, you’ll get a very particular dense reward for each frame, which you can train using basic gradient descent. Now, the difficulty is that, as we’ve seen, reward shaping isn’t the best answer; instead, we’d prefer to do this with a basic sparse reward, such as success or failure. The general notion behind hindsight experience replay is that they want to learn from all episodes, even if one or more of them failed to teach us the tasks we wanted to learn. As a result, hindsight experience replay employs a deceptively easy approach to persuade the agent to learn from a failed episode.

The agent begins by pushing an object about on the table, attempting to reach position A, but because the policy isn’t very good yet, the object ends up at position B, which is incorrect. Instead of simply informing you that the model did something wrong and you received a 0 reward, HER will act as if going to be was the thing you wanted it to do, and then tell you, “Yes, very well done, this is how you move the object to position B.” You’re essentially turning a sparse reward situation into a dense reward setting.

Start with a standard off-policy reinforcement learning algorithm and a sampling goal position method. We’ll just use our current policy to get a trajectory and a final position where the object ended up, so given a certain goal position, we’ll just use our current policy to get a trajectory and a final position where the object ended up in. We save all of those transitions in the replay buffer with the goal that was chosen for the policy once our episode has concluded. Then we sample a set of updated supplementary goals, swapping them out in state transitions, and finally saving everything in the replay buffer. The great thing about this algorithm is that once you’ve trained it, you’ll have a policy network that can do different things depending on the goal you give it. So, if you wish to shift the object to a different place, you don’t have to retrain the entire policy; simply modify the objective vector and your policy will adjust accordingly. The blue curve in this graph reflects the outcomes of our hindsight experience game, in which the additional sampled goal was always the proper conclusion state of the episode sequence. It’s the actual position in which the object ended up after performing a series of activities. When the additional goals are sampled from future states encountered on the same trajectories, the red curve illustrates the findings even better. The concept is simple, and the algorithm is simple to implement, but it addresses a basic difficulty in learning: we want to make the most of every experience we have.

Algorithm from HER Research paper:

Given:

• – an off-policy RL algorithm ,
• – a strategy  for sampling goals for replay,
• – a reward function .

Initialize

Initialize replay buffer

for episode = 1, M do

Sample a goal $g$ and an initial state .

for t=0, T-1 do

Sample an action  using the behavioral policy from \mathbb{A}

Execute the action   and observe a new state

end for

for t=0, T-1 do

Store the transition  in

Sample a set of additional goals for replay

for do

r^{\prime}:=r\left(s_{t}, a_{t}, g^{\prime}\right)

Store the transition  in

end for

end for

for t=1, N do

Sample a minibatch  from the replay buffer R

Perform one step of optimization using  and minibatch  The algorithm

end for

end for

## Curiosity Driven Exploration:

The main notion is that you want to encourage your agent to learn about new items it encounters in its environment in some way. People utilize epsilon greedy exploration in most default reinforcement learning algorithms. This indicates that in the vast majority of circumstances, your agent will choose the best feasible action based on its present policy, but with a small probability of epsilon, it will adopt a random action. Then, at the start of training, this epsilon number is 100%, which means it’s fully random, and as you train and advance, this epsilon value will start to decline until you completely follow your policy at the end. The idea is that your bot will learn to explore the surroundings by doing these random acts.

Now, the general idea behind curiosity-driven exploration is that in many cases, an agent can quickly learn a very simple behaviour that earns a recurring low amount of rewards, but if the environment is difficult to explore, a simple agent using epsilon greedy exploration will never fully explore the entire environment in search of better policies. The goal isn’t to produce a new reward signal that encourages the agent to explore previously unexplored areas of the state space. Recalling a forward model is a common approach to do this in reinforcement learning. This means that when your agent sees a given input frame, it will employ a feature extractor to encode the input data into some form of latent representation.

Then there’s a forward model, which attempts to predict the same latent representation for the following frame in the environment. So the premise is that if your agent is in a location where these prediction losses have occurred many times before, these predictions will be quite accurate. However, if it is confronted with a completely new situation that it has never encountered before, its forward model is unlikely to be accurate. The notion is that, in addition to the sparse incentives, you can utilize these prediction errors as an extra feedback signal to encourage your agent to explore previously unexplored regions of the state space. Researchers introduced an innate curiosity module in one of the publications and used a great example to demonstrate what this all means. Consider the following scenario: an agent is watching the movement of tree leaves in a breeze. Predicting pixel changes for each leaf will be very impossible due to the difficulty of accurately modelling the breeze.

This means that the pixel space prediction error will constantly be high, and the agent will be fascinated by the leaves. The underlying issue is that the agent is unaware that some aspects of the environment are beyond its ability to regulate or foresee.

Intrinsic Curiosity Model

The image above shows how the raw environment looks in the intrinsic curiosity model in the study. A single shared encoder network is used to encode states s and s+1 into feature space. Next, there are two models: a forward model that tries to forecast features for the future state using the policy’s selected action, and a backward model that tries to predict features for the next state using the policy’s chosen action. The inverse model, on the other hand, tries to forecast what action was taken to get from state s to state s +1. Finally, the feature encoding of s +1 is compared to the forward model’s projected feature encoding of s +1. The difference, which we may term the agent’s surprise at what transpired, is added to the reward signal for the agent’s training. Returning to our tree leaf example, since the motion of those leaves cannot be controlled by the agent’s activities, feature encoders have no motivation to model the behaviour of those tree leaves because, in the inverse model, those features will never be useful for predicting the agent’s action. As a result, the features generated by our extraction pipeline will be unaffected by irrelevant parts of our surroundings, giving us a much more robust exploration strategy. They use a maze exploration to benchmark their strategy in the study. You have a complex maze with the objective position, and your agent must navigate the maze to reach that goal.

When the maze’s size and complexity grow too large, all methods that do not involve intrinsic exploring methods start to fail. However, if you encourage your agent to explore uncharted territory, it is much more probable that it will finally find the prize. This is a really great notion because instead of just thinking about how to get to this objective position or how to receive this reward, your agent should be curious about the world and explore things it doesn’t know about in order to broaden their understanding of the environment, which is a really cool idea. Many academics are finding that getting your agent to explore your environment effectively and naturally is a critical aspect of learning.

In conclusion, we’ve just seen a few very different approaches to augmenting sparse reward signals with dense feedback, which I believe hints at some of the first steps toward truly unsupervised learning, but despite the impressive results, there are still a lot of really difficult problems in reinforcement learning, so things like generalization transfer and learning causality into physics remain as difficult as ever.

If you think about autonomous robotic assistance in your home, these items can still seem like science fiction to you today. However, I believe that we are currently working to solve some of the most fundamental issues in autonomous learning in the same way that we did in supervised learning. I believe that by looking at the incredible pace of advancement over the last few years and the sheer quantity of intellectual talent that is working on these challenges, people came up with algorithms like backpropagation or approaches like convolutions. I believe that breakthroughs could happen surprisingly quickly and that the darker side of this very exciting research is that many of the jobs we have today will be subject to a high degree of automation, an inevitable transition that I believe will create a lot of social pressure and inequalities, and it’s probably one of the most difficult challenges if we want to create a world where everyone benefits from artificial intelligence advancement.

My Personal Notes arrow_drop_up