Open In App

PySpark – Random Splitting Dataframe

Improve
Improve
Like Article
Like
Save
Share
Report

In this article, we are going to learn how to randomly split data frame using PySpark in Python.

A tool created by Apache Spark Community to use Python with Spark altogether is known as Pyspark. Thus, while working for the Pyspark data frame, Sometimes we are required to randomly split that data frame. In this article, we are going to achieve this using randomSplit() function of Pyspark. This function not only splits the data frame as per the fraction but always gives us different values when the function is run. 

randomSplit() function:

Syntax: data_frame.randomSplit(weights, seed=None)

Parameters:

  • weights: The list of double values in which the data frame will be split. 
  • seed: The seed for sampling which divides the data frame always in same fractional parts until the seed value or weights value is changed.

Prerequisite

Note: In the article about installing Pyspark we have to install python instead of scala rest of the steps are the same.

Modules Required:

Pyspark: The API which was introduced to support Spark and Python language and has features of Scikit-learn and Pandas libraries of Python is known as Pyspark. This module can be installed through the following command in Python:

pip install pyspark

Stepwise Implementation:

Step 1: First of all, import the required libraries, i.e. SparkSession. The SparkSession library is used to create the session. 

from pyspark.sql import SparkSession

Step 2: Now, create a spark session using the getOrCreate function.

spark_session = SparkSession.builder.getOrCreate()

Step 3: Then, read the CSV file and display it to see if it is correctly uploaded. 

data_frame=csv_file = spark_session.read.csv('Path_to_csv_file',
                                             sep = ',', inferSchema = True, header = True)
data_frame.show()

Step 4: Next, split the data frame randomly using randomSplit function having weights and seeds as arguments. Further, store the split data frame either in the list or different variables.

splits=data_frame.randomSplit(weights, seed=None)

Step 5: Finally, display the list elements or the variables to see how the data frame is split.

splits[0].count()
splits[1].count()

Example 1: 

In this example, we have split the data frame (link) through randomSplit function by only weights as an argument and stored it in the list. We have split the data frame twice through randomSplit function to see if we get the same fractional value each time. What we observed is that we got different values each time.

Python3




# Python program to show random sampling of
# Pyspark data frame without seed as argument
# and storing the result in list
 
# Import the SparkSession library
from pyspark.sql import SparkSession
 
# Create a spark session using getOrCreate() function
spark_session = SparkSession.builder.getOrCreate()
 
# Read the CSV file
# Here CSV is in the same folder
data_frame=csv_file = spark_session.read.csv(
  'california_housing_train.csv',
  sep = ',', inferSchema = True, header = True)
 
# Display the csv file read
data_frame.show()
 
# Split the dataframe into 2 parts, split1 & split2
# with only weights as argument so that dataframe is
# always split in different fractional parts
splits = data_frame.randomSplit([1.0, 3.0])
 
# Checking the count of 1st value of splitted dataframe
splits[0].count()
 
# Checking the count of 2nd value of splitted dataframe
splits[1].count()
 
# Again split the dataframe into 2 parts, split1 & split2
# with weights & seed as argument to know if splitting
# value changes or remains same
splits = data_frame.randomSplit([1.0, 3.0])
 
# Checking the count of 1st value of splitted dataframe
splits[0].count()
 
# Checking the count of 2nd value of splitted dataframe
splits[1].count()


Output:

4233
12767
4202
12798

Example 2:

In this example, we have split the data frame (link) through randomSplit function by weights as well as seed as arguments and stored it in the list. We have split the data frame twice through randomSplit function to see if we get the same fractional value each time. What we observed is that we got the same values each time.

Python3




# Python program to show random sampling of
# Pyspark data frame with seed as argument
# and storing the result in list
 
# Import the SparkSession library
from pyspark.sql import SparkSession
 
# Create a spark session using getOrCreate() function
spark_session = SparkSession.builder.getOrCreate()
 
# Read the CSV file
# Here CSV file saved in same folder
data_frame=csv_file = spark_session.read.csv(
  'california_housing_train.csv',
  sep = ',', inferSchema = True, header = True)
 
# Display the csv file read
data_frame.show()
 
# Split the dataframe into 2 parts, split1 & split2
# with weights & seed as argument so that dataframe is
# always split in same fractional parts
splits = data_frame.randomSplit([1.0, 3.0],26)
 
# Checking the count of 1st value of splitted dataframe
splits[0].count()
 
# Checking the count of 2nd value of splitted dataframe
splits[1].count()
 
# Again split the dataframe into 2 parts, split1 & split2
# with weights & seed as argument to know if splitting
# value changes or remains same
splits = data_frame.randomSplit([1.0, 3.0],26)
 
# Checking the count of 1st value of splitted dataframe
splits[0].count()
 
# Checking the count of 2nd value of splitted dataframe
splits[1].count()


Output:

4181
12819
4181
12819

Example 3:

In this example, we have split the data frame (link) through randomSplit function by only weights as an argument and stored it in the list. We have split the data frame twice through randomSplit function to see if we get the same fractional value each time. What we observed is that we got different values each time.

Python3




# Python program to show random sampling of
# Pyspark data frame without seed as argument
# and storing the result in variables
 
# Import the SparkSession library
from pyspark.sql import SparkSession
 
# Create a spark session using getOrCreate() function
spark_session = SparkSession.builder.getOrCreate()
 
# Read the CSV file
# Here CSV file is saved in same folder
data_frame=csv_file = spark_session.read.csv(
  'california_housing_train.csv',
  sep = ',', inferSchema = True, header = True)
 
# Display the csv file read
data_frame.show()
 
# Split the dataframe into 2 parts, split1 & split2
# with only weights as argument so that dataframe is
# always split in different fractional parts
split1, split2 = data_frame.randomSplit([1.0, 5.0])
 
# Checking the count of split1
split1.count()
 
# Checking the count of split2
split2.count()
 
# Again split the dataframe into 2 parts, split1 & split2
# with weights & seed as argument to know if splitting
# value changes or remains same
split1, split2 = data_frame.randomSplit([1.0, 5.0])
 
# Checking the count of split1
split1.count()
 
# Checking the count of split2
split2.count()


Output:

2818
14182
2783
14217

Example 4:

In this example, we have split the data frame (link) through randomSplit function by weights as well as seed as arguments and stored it in the variables. We have split the data frame twice through randomSplit function to see if we get the same fractional value each time. What we observed is that we got the same values each time.

Python3




# Python program to show random sampling of
# Pyspark data frame with seed as argument
# and storing the result in variables
 
# Import the SparkSession library
from pyspark.sql import SparkSession
 
# Create a spark session using getOrCreate() function
spark_session = SparkSession.builder.getOrCreate()
 
# Read the CSV file
# Here csv file is saved in same folder
data_frame=csv_file = spark_session.read.csv(
  'california_housing_train.csv',
   sep = ',', inferSchema = True, header = True)
 
# Display the csv file read
data_frame.show()
 
# Split the dataframe into 2 parts, split1 & split2
# with weights & seed as argument so that dataframe is
# always split in same fractional parts
split1, split2 = data_frame.randomSplit([1.0, 5.0],24)
 
# Checking the count of split1
split1.count()
 
# Checking the count of split2
split2.count()
 
# Again split the dataframe into 2 parts, split1 & split2
# with weights & seed as argument to know if splitting
# value changes or remains same
split1, split2 = data_frame.randomSplit([1.0, 5.0],24)
 
# Checking the count of split1
split1.count()
 
# Checking the count of split2
split2.count()


Output:

2776
14224
2776
14224


Last Updated : 01 Feb, 2023
Like Article
Save Article
Previous
Next
Share your thoughts in the comments
Similar Reads