Open In App

Pandas – Create Test and Train Samples from DataFrame

Last Updated : 09 Dec, 2022
Improve
Improve
Like Article
Like
Save
Share
Report

We make use of large datasets to make a machine learning or deep learning model. While making one of these models, it is required to split our dataset into train and test sets because we want to train our model on the train set and then observe its performance on the test set. These datasets are loaded inside the Python environment in the form of a DataFrame. In this article, we are going to learn about different ways in which we can create train and test samples from a Pandas DataFrame in Python. For demonstration purposes, we will be using a toy dataset (iris dataset) present in the sklearn.datasets module and load it inside a DataFrame. Firstly we will import all the necessary libraries. 

Importing Libraries and Dataset

Python libraries make it very easy for us to handle the data and perform typical and complex tasks with a single line of code.

  • Pandas – This library helps to load the data frame in a 2D array format and has multiple functions to perform analysis tasks in one go.
  • Numpy – Numpy arrays are very fast and can perform large computations in a very short time.
  • Sklearn – This module contains multiple libraries having pre-implemented functions to perform tasks from data preprocessing to model development and evaluation.

Python3




import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris


Now, we will load the iris dataset in the form of a DataFrame using Pandas and then view its first five rows in the DataFrame using the head() method.

Python3




df = load_iris()
df = pd.DataFrame(data=df.data, columns=df.feature_names)
df.head()


Output:

 

If we check the shape of the dataframe using the df.shape() method, the output will be shown as (150, 4) which means the DataFrame has 150 rows and 4 columns. Now, we want to split this DataFrame into train and test sets. Here, we will split 80% of our original DataFrame into the train set and the remaining 20% into the test set. This means 120 rows of data should be present in the train set and the remaining 30 rows should be present in the test set.

Manually splitting the data frame into train and test set

The approach that we will follow to perform splitting is will consider the first 80% of the rows as the training data and the remaining ones will serve as the testing data.

Python3




eighty_pct = 0.8*df.shape[0]
  
train_set = df.loc[:eighty_pct-1, :]
test_set = df.loc[eighty_pct:, :]
  
train_set.shape, test_set.shape


Output:

((120, 4), (30, 4))

Here, we have manually allocated 80% of the rows to the train set and the rest 20% to the test set. One should use this method if and only if one is sure that the data is uniformly distributed in the dataframe and well shuffled.

Using the DataFrame.sample() method

This method is an extension of the previous method the only thing that we do here is remove the drawback by using the sample() method which can select particular rows from the dataset randomly.

Python3




train_set = df.sample(frac=0.8, random_state=42)
  
# Dropping all those indexes from the dataframe that exists in the train_set
test_set = df.drop(train_set.index)
train_set.shape, test_set.shape


Output:

((120, 4), (30, 4))

Here, we have used the sample() method present with the DataFrame to get a sample of DataFrame from the original data. In the sample() method, we have passed two arguments, frac is the amount of percentage of the sample we want from the DataFrame. Since in the train set we require 80% of the data, therefore, we have passed frac=0.8 and random_state=42 acts as a seed value which helps in generating the same sample across different calls. Then for the test set, we dropped all those rows from the original dataset that were present in the train set hence we have only 20% of data remaining in the test set.

Using the train_test_split() method present in the Sklearn

In practice one of the most common methods that are used to perform the splitting of the dataframe is the train_test_split() method. This method can help us to randomly split two data frames as well simultaneously that may be your feature vector and the target vector.

Python3




train_set, test_set = train_test_split(df,random_state=42,test_size=0.2)
print(train_set.shape, test_set.shape)


Output:

(120, 4) (30, 4)

Here, we are making use of the train_test_split() method present in the sklearn.model_selection module to split our DataFrame into train and test sets. We are passing three arguments to the train_test_split() function, the first argument is the original DataFrame itself, the second argument is the random_state which works as explained in the previous method and the third argument is the test_size which means how many samples of the entire DataFrame we want in the test set. Since we need 20% data as a test set we are passing test_size=0.2. The train_test_split() function returns 80% of the rows in the train set and rests 20% data in the test set.

Using Numpy.random.rand() method

In this, we will create a mask containing 0 and 1 of the size of the dataframe and then we will select those rows with 1 in the corresponding mask as 1 and the remaining as the testing where the mask value is 0. But how to ensure this will be in an 80:20 ratio? The answer to this lies in the fact how numpy.random.rand function generates numbers.

Python3




mask = np.random.rand(len(df)) < 0.8
train_set = df[mask]
test_set = df[~mask]
train_set.shape, test_set.shape


Output:

((119, 4), (31, 4))


Like Article
Suggest improvement
Share your thoughts in the comments

Similar Reads