Splitting Data for Machine Learning Models

Data is at the heart of every ML problem. Without proper data, ML models are just like bodies without soul. But in today’s world of ‘big data’ collecting data is not a major problem anymore. We are knowingly (or unknowingly) generating huge datasets every day. However, having surplus data at hand still does not solve the problem. For ML models to give reasonable results, we not only need to feed in large quantities of data but also have to ensure the quality of data. 
Though making sense out of raw data is an art in itself and requires good feature engineering skills and domain knowledge (in special cases), the quality data is of no use until it is properly used. The major problem which ML/DL practitioners face is how to divide the data for training and testing. Though it seems like a simple problem at first, its complexity can be gauged only by diving deep into it. Poor training and testing sets can lead to unpredictable effects on the output of the model. It may lead to overfitting or underfitting of the data and our model may end up giving biased results. 
How to divide the data then? 
The data should ideally be divided into 3 sets – namely, train, test, and holdout cross-validation or development (dev) set. Let’s first understand in brief what these sets mean and what type of data they should have. 
 

  • Train Set: 
    The train set would contain the data which will be fed into the model. In simple terms, our model would learn from this data. For instance, a Regression model would use the examples in this data to find gradients in order to reduce the cost function. Then these gradients will be used to reduce the cost and predict data effectively.
  • Dev Set: 
    The development set is used to validate the trained model. This is the most important setting as it will form the basis of our model evaluation. If the difference between error on the training set and error on the dev set is huge, it means the model as high variance and hence, a case of over-fitting.
  • Test Set: 
    The test set contains the data on which we test the trained and validated model. It tells us how efficient our overall model is and how likely is it going to predict something which does not make sense. There are a plethora of evaluation metrics (like precision, recall, accuracy, etc.) which can be used to measure the performance of our model.

Some tips to choose Train/Dev/Test sets 
 

  • The size of the train, dev, and test sets remains one of the vital topics of discussion. Though for general Machine Learning problems a train/dev/test set ratio of 80/20/20 is acceptable, in today’s world of Big Data, 20% amounts to a huge dataset. We can easily use this data for training and help our model learn better and diverse features. So, in case of large datasets (where we have millions of records), a train/dev/test split of 98/1/1 would suffice since even 1% is a huge amount of data.

          Old Distribution: 
 

Train (80%)                                                                                                        Dev (20%)                          Test (20%)                        

      So now we can split our data set with a Machine Learning Library called Turicreate.It Will help us to split the data into train, test, and dev.

Python3

filter_none

edit
close

play_arrow

link
brightness_4
code

# Importing the turicreate Library
import turicreate as tc
 
# Now Loading the data
data=tc.SFrame("data.csv")
 
# Turicreate has a library named as random
# split that will the data randomly among the train,test
# Dev will be part of test set and we will split that data later.
train_data_set,test_data=data.random_split(.8,seed=0)
 
# In this 0.8 it means that we will have 80%
# as our training data and rest 20% data as test data
# Here seed is for giving the same set for
# train and test again and again
# Now we will split our test_data into
# two different sets of equal length
test_data_set,dev_set=test_data.random_split(.5,seed=0)
 
# It will split the test data into 50%
# for dev_set and 50% for test_data_set
#Now making a example model for showing
# how to use these sets.
model=tc.linear_regression.create(train_data,target=["XYz"],validation set=dev_set)
 
# In this model we have our validation
# set as dev_set and input data as our train_data
# XYZ are random features about data
# Now we will validate and test our model
# with the help of our test_data_set
model.predict(test_data_set[1])
#It will predict

chevron_right


        Distribution in Big data era: 
 



Train (98%)                                                                                                                                                                               Dev  (1%)  Test  (1%) 
  • Dev and test set should be from the same distribution. We should prefer taking the whole dataset and shuffle it. Then we can randomly split it into dev and test set
  • Train set may come from a slightly different distribution than dev/test set
  • We should choose a dev and test set to reflect what data we expect to get in the future and data which you consider important to do well on. Dev set and test set should be such that your model becomes more robust

Python3

filter_none

edit
close

play_arrow

link
brightness_4
code

# Importing the turicreate Library
import turicreate as tc
 
# Now Loading the data
data=tc.SFrame("data.csv")
 
# Turicreate has a library named as
# random split that will the data
# randomly among the train,test
#Dev will be part of test set and
# we will split that data later.
train_data_set,test_data=data.random_split(.98,seed=0)
 
# In this 0.8 it means that we will have 98%
# as our training data and rest 2% data as test data
# Here seed is for giving the same set for train and
# test again and again
# Now we will split our test_data into two
# different sets of equal length
test_data_set,dev_set=test_data.random_split(.5,seed=0)
 
# It will split the test data into 50%
# for dev_set and 50% for test_data_set
# Now making a example model for showing
# how to use these sets.Here 50% means that
# 50% of the test_data
model=tc.linear_regression.create(train_data,target=["XYz"],validation set=dev_set)
 
# In this model we have our validation set
# as dev_set and input data as our train_data
# XYZ are random features about data
# Now we will validate and test our model
# with the help of our test_data_set
model.predict(test_data_set[1])
#It will predict

chevron_right


Handling mismatched Train and Dev/Test sets: 
There may be cases where the train set and dev/test set come from slightly different distributions. For e.g., suppose we are building a mobile app to classify flowers into different categories. The user would click the image of the flower and our app will output the name of the flower. 
Now suppose in our dataset, we have 200,000 images which are taken from web pages and only 10,000 images which are generated from mobile cameras. In this scenario, we have 2 possible options: 
Option 1: We can randomly shuffle the data and divide the data into train/dev/test sets as   
 

Set Train (205,000)                                                                                                                  Dev (2,500)          Test (2,500)           
Source  Random  Random  Random 

In this case, all train, dev and test sets are from same distribution but the problem is that dev and test set will have a major chunk of data from web images which we do not care about. 
Option 2: We can take all the images from web pages into the train set, add 5,000 camera-generated images to it and divide the rest 5,000 camera images in dev and test set. 
 

Set Train (205,000)                                                                                                                   Dev (2,500)            Test (2,500)          
Source  (200,000 from web app and 5,000 camera) Camera Camera

In this case, we target the distribution we really care about (camera images), hence it will lead to better performance in the long run. 
When to change Dev/Test set? 
Suppose we have 2 models A and B with 3% and 5% error rate on dev set respectively. Though it seems A has better performance, let’s say it was letting so some censored data too which is not acceptable to you. In the case of B, though it does have a high error rate, the probability of letting go censored data is negligible. In this case metrics and dev set favor model A but you and other users favor model B. This is a sign that there is a problem either in the metrics used for evaluation or the dev/train set. 
To solve this, we can either add a penalty to the cost function in case the censored data. One cause may be that the images in dev/test set were high resolution but those in real-time were blurry. Here, we need to change the dev/test set distribution. This was all about splitting datasets for ML problems. 
 




My Personal Notes arrow_drop_up

Check out this Author's contributed articles.

If you like GeeksforGeeks and would like to contribute, you can also write an article using contribute.geeksforgeeks.org or mail your article to contribute@geeksforgeeks.org. See your article appearing on the GeeksforGeeks main page and help other Geeks.

Please Improve this article if you find anything incorrect by clicking on the "Improve Article" button below.



Article Tags :
Practice Tags :


Be the First to upvote.


Please write to us at contribute@geeksforgeeks.org to report any issue with the above content.