GeeksforGeeks App
Open App
Browser
Continue

# Introduction to Thompson Sampling | Reinforcement Learning

Reinforcement Learning is a branch of Machine Learning, also called Online Learning. It is used to decide what action to take at t+1 based on data up to time t. This concept is used in Artificial Intelligence applications such as walking. A popular example of reinforcement learning is a chess engine. Here, the agent decides upon a series of moves depending on the state of the board (the environment), and the reward can be defined as a win or lose at the end of the game.

Thompson Sampling (Posterior Sampling or Probability Matching) is an algorithm for choosing the actions that address the exploration-exploitation dilemma in the multi-armed bandit problem. Actions are performed several times and are called exploration. It uses training information that evaluates the actions taken rather than instructs by giving correct actions. This is what creates the need for active exploration, for an explicit trial-and-error search for good behavior. Based on the results of those actions, rewards (1) or penalties (0) are given for that action to the machine. Further actions are performed in order to maximize the reward that may improve future performance. Suppose a robot has to pick several cans and put them in a container. Each time it puts the can to the container, it will memorize the steps followed and train itself to perform the task with better speed and precision (reward). If the Robot is not able to put the can in the container, it will not memorize that procedure (hence speed and performance will not improve) and will be considered as a penalty.

Thompson Sampling has the advantage of the tendency to decrease the search as we get more and more information, which mimics the desirable trade-off in the problem, where we want as much information as possible in fewer searches. Hence, this Algorithm has a tendency to be more “search-oriented” when we have fewer data and less “search-oriented” when we have a lot of data.

Multi-Armed Bandit Problem
Multi-armed Bandit is synonymous with a slot machine with many arms. Each action selection is like a play of one of the slot machine’s levers, and the rewards are the payoffs for hitting the jackpot. Through repeated action selections you are to maximize your winnings by concentrating your actions on the best levers. Each machine provides a different reward from a probability distribution over the mean reward specific to the machine. Without knowing these probabilities, the gambler has to maximize the sum of reward earned through a sequence of arms pull. If you maintain estimates of the action values, then at any time step there is at least one action whose estimated value is greatest. We call this a greedy action. The analogy to this problem can be advertisements displayed whenever the user visits a webpage. Arms are ads displayed to the users each time they connect to a web page. Each time a user connects to the page makes around. At each round, we choose one ad to display to the user. At each round n, ad I gives reward ri(n) ε {0, 1}: ri(n)=1 if the user clicked on the ad i, 0 if the user didn’t. The goal of the algorithm will be to maximize the reward. Another analogy is that of a doctor choosing between experimental treatments for a series of seriously ill patients. Each action selection is a treatment selection, and each reward is the survival or well-being of the patient.

Algorithm

Some Practical Applications

• Netflix Item based recommender systems: Images related to movies/shows are shown to users in such a way that they are more likely to watch it.
• Bidding and Stock Exchange: Predicting Stocks based on Current data of stock prices.
• Traffic Light Control: Predicting the delay in the signal.
• Automation in Industries: Bots and Machines for transporting and Delivering items without human intervention.
• Robotics: Reinforcement learning is used in robotics for motion planning, grasping objects, and controlling the robot’s movement. It enables robots to learn from experience and make decisions based on their environment.
• Game AI: Reinforcement learning has been used to train AI agents to play games like Chess, Go, and Poker. It has been used to develop game bots that can compete against human players.
• Natural Language Processing (NLP): Reinforcement learning is used in NLP to train chatbots and virtual assistants to provide personalized responses to users. It enables chatbots to learn from user interactions and improve their responses over time.
• Advertising: Reinforcement learning is used in advertising to optimize ad placements and target audiences. It enables advertisers to learn which ads perform best and adjust their campaigns accordingly.
• Finance: Reinforcement learning is used in finance for portfolio management, fraud detection, and risk assessment. It enables financial

My Personal Notes arrow_drop_up