PySpark Random Sample with Example

Last Updated : 28 Dec, 2022

Are you in the field of job where you need to handle a lot of data on the daily basis? Then, you might have surely felt the need to extract a random sample from the data set. There are numerous ways to get rid of this problem. Don’t know all the ways? Continue reading the article further to know more about the random sample extraction in the Pyspark data set using Python.

Prerequisite

Note: In the article about installing Pyspark we have to install python instead of scala rest of the steps are the same.

Modules Required:

Pyspark: An open source, distributed computing framework and set of libraries for real-time, large-scale data processing API primarily developed for Apache Spark, is known as Pyspark. This module can be installed through the following command in Python.

pip install pyspark

Student_data.csv file:

Methods to get Pyspark Random Sample:

PySpark SQL Sample
1. Using sample function
2. Using sampleBy function
PySpark RDD Sample
1. Using sample function
2. Using takeSample function

PySpark SQL Sample

1. Using sample function:

Here we are using Sample Function to get the PySpark Random Sample.

Syntax: sample(withReplacement, fraction, seed=None)

Here,

withReplacement – Boolean value to get repeated values or not. True means duplicate values exist, while false means there are no duplicates. By default, the value is False.

fraction – Fractional number which represents the fractional rows to generate in the range of 0 to 1.

seed – The seed for sampling which divides the data frame always in the same fractional parts until the seed value or weights value is changed.

Stepwise Implementation:

Step 1: First of all, import the required libraries, i.e. SparkSession. The SparkSession library is used to create the session.

from pyspark.sql import SparkSession

Step 2: Now, create a spark session using the getOrCreate function.

spark_session = SparkSession.builder.getOrCreate()

Step 3: Then, read the CSV file and display it to see if it is correctly uploaded.

data_frame=csv_file = spark_session.read.csv('#Path of CSV file', sep = ',', inferSchema = True, header = True)
data_frame.show()

Step 4: Finally, extract the random sample of the data frame using the sample function with withReplacement, fraction, and seed as arguments.

data_frame.sample(withReplacement, fraction, seed=None)

Example 1:

In this example, we have extracted the sample from the data frame i.e., the dataset of 5×5, through the sample function by only a fraction as an argument. We have extracted the random sample twice through the sample function to see if we get the same fractional value each time. What we observed is that we got different values each time.

Python3

# Python program to extract Pyspark random sample  
# through sample function with fraction as argument 
  
# Import the SparkSession library 
from pyspark.sql import SparkSession 
  
# Create a spark session using getOrCreate() function 
spark_session = SparkSession.builder.getOrCreate() 
  
# Read the CSV file 
data_frame=csv_file = spark_session.read.csv('/content/student_data.csv', 
                           sep = ',', inferSchema = True, header = True) 
  
# Extract random sample through sample  
# function using only fraction as argument  
data_frame.sample(0.4).collect() 
  
# Again extract random sample through sample function using only  
# fraction as argument to check if we get same output each time 
data_frame.sample(0.4).collect()

Output:

When we run the sample command for the first time, we got the following output:

When we run the sample command for the second time, we got the following output:

Example 2:

In this example, we have extracted the sample from the data frame (link)i.e., the dataset of 5×5, through the sample function by a fraction as well as seed as arguments. We have extracted the sample twice through the sample function to see if we get the same fractional value each time. What we observed is that we got the same values each time.

Python3

# Python program to extract Pyspark random sample through  
# sample function with fraction and seed as arguments 
  
# Import the SparkSession library 
from pyspark.sql import SparkSession 
  
# Create a spark session using getOrCreate() function 
spark_session = SparkSession.builder.getOrCreate() 
  
# Read the CSV file 
data_frame=csv_file = spark_session.read.csv('/content/student_data.csv', 
                            sep = ',', inferSchema = True, header = True) 
  
# Extract random sample through sample function  
# using seed and fraction as arguments  
data_frame.sample(0.4,26).collect() 
  
# Again extract random sample through sample function using seed and 
# fraction as arguments to check if we get same output each time 
data_frame.sample(0.4,26).collect()

Output:

When we run the sample command for the first time, we got the following output:

When we run the sample command for the second time, we got the following output:

Example 3:

In this example, we have extracted the sample from the data frame i.e., the dataset of 5×5, through the sample function by a fraction and withReplacement as arguments. We have extracted the sample twice through the sample function, one time by using the False value of withReplacement variable, and the second time by using the True value of withReplacement variable. What we observed is that by using False as a variable value, it didn’t give repeated values, while with True as a variable value, it gave some repeated values.

Python3

# Python program to extract Pyspark random sample through 
# sample function with fraction and withReplacement as arguments 
  
# Import the SparkSession library 
from pyspark.sql import SparkSession 
  
# Create a spark session using getOrCreate() function 
spark_session = SparkSession.builder.getOrCreate() 
  
# Read the CSV file 
data_frame=csv_file = spark_session.read.csv('/content/student_data.csv', 
                            sep = ',', inferSchema = True, header = True) 
  
# Extract random sample through sample function using  
# withReplacement (value=True) and fraction as arguments  
data_frame.sample(True,0.8).collect() 
  
# Again extract random sample through sample function using  
# withReplacement (value=False) and fraction as arguments  
data_frame.sample(False,0.8).collect()

Output:

When we run the sample command for the first time, we got the following output:

When we run the sample command for the second time, we got the following output:

2. Using sampleBy function

Syntax: sampleBy(column, fractions, seed=None)

Here,

column – column name from DataFrame

fractions – The values of the particular column in the form of a dictionary which takes key and value as parameters.

seed – The seed for sampling which divides the data frame always in the same fractional parts until the seed value or weights value is changed.

Stepwise Implementation:

Step 1: First of all, import the required libraries, i.e. SparkSession. The SparkSession library is used to create the session.

from pyspark.sql import SparkSession

Step 2: Now, create a spark session using the getOrCreate function.

spark_session = SparkSession.builder.getOrCreate()

Step 3: Then, read the CSV file and display it to see if it is correctly uploaded.

data_frame=csv_file = spark_session.read.csv('#Path of CSV file', sep = ',', inferSchema = True, header = True)
data_frame.show()

Step 4: Finally, extract the random sample of the data frame using the sampleBy function with column, fractions, and seed as arguments.

data_frame.sampleBy(column, fractions, seed=None)

Example:

In this example, we have extracted the sample from the data frame (link) i.e., the dataset of 5×5, through the sampleBy function by column, fractions, and seed as arguments.

Python3

# Python program to extract Pyspark random sample through 
# sampleBy function with column, fraction and seed as arguments 
  
# Import the SparkSession library 
from pyspark.sql import SparkSession 
  
# Create a spark session using getOrCreate() function 
spark_session = SparkSession.builder.getOrCreate() 
  
# Read the CSV file 
data_frame=csv_file = spark_session.read.csv('/content/student_data.csv', 
                              sep = ',', inferSchema = True, header = True) 
  
# Extract random sample through sample function  
# using column, fraction and seed as arguments  
print(data_frame.sampleBy(data_frame.fees, {18000: 0.4, 15000: 0.6},0).collect())

Output:

Pyspark RDD Sample

1. Using sample function:

Here we are using Sample Function to get the PySpark Random Sample.

Syntax: sample(withReplacement, fraction, seed=None)

Stepwise Implementation:

Step 1: First of all, import the required libraries, i.e. SparkSession. The SparkSession library is used to create the session.

from pyspark.sql import SparkSession

Step 2: Now, create a spark session using the getOrCreate function.

spark_session = SparkSession.builder.getOrCreate()

Step 3: Then, read the CSV file and display it to see if it is correctly uploaded.

data_frame=csv_file = spark_session.read.csv('#Path of CSV file', sep = ',', inferSchema = True, header = True)
data_frame.show()

Step 4: Next, convert the data frame to an RDD data frame for performing sampling operations.

data_frame_rdd=data_frame.rdd

Step 5: Finally, extract the random sample of the data frame using the sample function with withReplacement, fraction, and seed as arguments.

data_frame_rdd.sample(withReplacement, fraction, seed=None)

Example:

In this example, we have extracted the sample from the data frame ,i.e., the dataset of 5×5, through the sample function by a fraction and withReplacement as arguments. We have extracted the sample twice through the sample function, one time by using the False value of withReplacement variable, and the second time by using the True value of withReplacement variable. What we observed is that by using False as a variable value, it didn’t give repeated values, while with True as a variable value, it gave some repeated values.

Python3

# Python program to extract Pyspark random sample through 
# sample function with fraction and seed as arguments 
  
# Import the SparkSession library 
from pyspark.sql import SparkSession 
  
# Create a spark session using getOrCreate() function 
spark_session = SparkSession.builder.getOrCreate() 
  
# Read the CSV file 
data_frame=csv_file = spark_session.read.csv('/content/student_data.csv', 
                            sep = ',', inferSchema = True, header = True) 
  
# Convert the data frame to RDD dataframe 
data_frame_rdd=data_frame.rdd 
  
# Extract random sample through sample function using  
# withReplacement (value=True) and fraction as arguments 
print(data_frame_rdd.sample(True,0.2).collect()) 
  
# Again extract random sample through sample function using  
# withReplacement (value=False) and fraction as arguments 
print(data_frame_rdd.sample(False,0.2).collect())

Output:

When we run the sample command for the first time, we got the following output:

When we run the sample command for the second time, we got the following output:

2. Using takeSample function

Here we are using TakeSample Function to get the PySpark Random Sample.

Syntax:takeSample(withReplacement, num, seed=None)

Stepwise Implementation:

Step 1: First of all, import the required libraries, i.e. SparkSession. The SparkSession library is used to create the session.

from pyspark.sql import SparkSession

Step 2: Now, create a spark session using the getOrCreate function.

spark_session = SparkSession.builder.getOrCreate()

Step 3: Then, read the CSV file and display it to see if it is correctly uploaded.

data_frame=csv_file = spark_session.read.csv('#Path of CSV file', sep = ',', inferSchema = True, header = True)
data_frame.show()

Step 4: Next, convert the data frame to an RDD data frame for performing sampling operations.

data_frame_rdd=data_frame.rdd

Step 5: Finally, extract the random sample of the data frame using the sample function with withReplacement, num, and seed as arguments.

data_frame_rdd.takeSample(withReplacement, num, seed=None)

Example:

In this example, we have extracted the sample from the data frame,i.e., the dataset of 5×5, through the takeSample function by num and withReplacement as arguments. We have extracted the sample twice through the sample function, one time by using the False value of withReplacement variable, and the second time by using the True value of withReplacement variable. What we observed is that by using False as a variable value, it didn’t give repeated values, while with True as a variable value, it gave some repeated values.

Python3

# Python program to extract Pyspark random sample through 
# takeSample function with withReplacement, num and seed as arguments 
  
# Import the SparkSession library 
from pyspark.sql import SparkSession 
  
# Create a spark session using getOrCreate() function 
spark_session = SparkSession.builder.getOrCreate() 
  
# Read the CSV file 
data_frame=csv_file = spark_session.read.csv('/content/student_data.csv', 
                             sep = ',', inferSchema = True, header = True) 
  
# Convert the data frame to RDD dataframe 
data_frame_rdd=data_frame.rdd 
  
# Extract random sample through takeSample function using  
# withReplacement (value=True), num and seed as arguments 
print(data_frame_rdd.takeSample(True,2,2)) 
  
# Again extract random sample through takeSample function using  
# withReplacement (value=False), num and seed as arguments 
print(data_frame_rdd.takeSample(False,2,2))

Output:

When we run the sample command for the first time, we got the following output:

When we run the sample command for the second time, we got the following output:

Suggest improvement

Calculate Time Difference in Python

Bulk Insert to Pandas DataFrame Using SQLAlchemy - Python

Share your thoughts in the comments

PySpark Random Sample with Example

Prerequisite

Modules Required:

Methods to get Pyspark Random Sample:

PySpark SQL Sample

1. Using sample function:

Stepwise Implementation:

Python3

Python3

Python3

2. Using sampleBy function

Stepwise Implementation:

Python3

Pyspark RDD Sample

1. Using sample function:

Stepwise Implementation:

Python3

2. Using takeSample function

Stepwise Implementation:

Python3

Please Login to comment...

Similar Reads

What kind of Experience do you want to share?