What is Active Learning?
Active Learning is a special case of Supervised Machine Learning. This approach is used to construct a high performance classifier while keeping the size of the training dataset to a minimum by actively selecting the valuable data points.
Where should we apply active learning?
- We have a very small amount or a huge amount of dataset.
- Annotation of the unlabeled dataset cost human effort, time and money.
- We have access to limited processing power.
On a certain planet, there are various fruits of different size(1-5), some of them are poisonous and others don’t. The only criteria to decide a fruit is poisonous or not is it’s size. our task is to train a classifier which predicts the given fruit is poisonous or not. The only information we have, is fruit with size 1 is not poisonous, the fruit of size 5 is poisonous and after a particular size, all fruits are poisonous.
The first approach is to check each and every size of the fruit, which consume time and resources.
The second approach is to apply the binary search and find the transition point (decision boundary). This approach uses fewer data and gives the same results as of linear search.
General Algorithm : 1. train classifier with the initial training dataset 2. calculate the accuracy 3. while(accuracy < desired accuracy): 4. select the most valuable data points (in general points close to decision boundary) 5. query that data point/s (ask for a label) from human oracle 6. add that data point/s to our initial training dataset 7. re-train the model 8. re-calculate the accuracy
Approaches Active Learning Algorithm
1. Query Synthesis
- Generally this approach is used when we have a very small dataset.
- This approach we choose any uncertain point from given n-dimensional space. we don’t care about the existence of that point.
- Sometime it would be difficult for human oracle to annotate the queried data point.
- This approach is used when we have a large dataset.
- In this approach we split our dataset into three parts: Training Set; Test Set; Unlabeled Pool(ironical) [5%; 25%, 70%].
- This trainig dataset is our initial dataset and is used to initially train our model.
- This approach selects valuable/uncertain points from this unlabeled pool, this ensures that all the querry can be recognized by human oracle
Here is an active learning model which decide valuable points on the basis of, the probability of a point present in a class. In Logistic Regression points closest to the threshold (i.e. probability = 0.5) is the most uncertain point. So, I choose the probability between 0.47 to 0.53 as a range of uncertainty.
You can download the dataset from here.
Output: Accuracy by active model : 80.7 Accuracy by random sampling : 79.5
There are several models for the selection of most valuable points. Some of them are:
- Querry by committee
- Query synthesis and Nearest neighbour search
- Large margin-based heuristics
- Posterior probability-based heuristics
- Learning Model Building in Scikit-learn : A Python Machine Learning Library
- Learning to learn Artificial Intelligence | An overview of Meta-Learning
- Introduction to Multi-Task Learning(MTL) for Deep Learning
- ML | Reinforcement Learning Algorithm : Python Implementation using Q-learning
- Artificial intelligence vs Machine Learning vs Deep Learning
- ML | Types of Learning – Supervised Learning
- How to Start Learning Machine Learning?
- Q-Learning in Python
- Deep Q-Learning
- Machine Learning in C++
- Reinforcement learning
- ML | What is Machine Learning ?
- Regularization in Machine Learning
- Stacking in Machine Learning
- ML | Semi-Supervised Learning
If you like GeeksforGeeks and would like to contribute, you can also write an article using contribute.geeksforgeeks.org or mail your article to email@example.com. See your article appearing on the GeeksforGeeks main page and help other Geeks.
Please Improve this article if you find anything incorrect by clicking on the "Improve Article" button below.