Open In App

PySpark Random Sample with Example

Last Updated : 28 Dec, 2022
Improve
Improve
Like Article
Like
Save
Share
Report

Are you in the field of job where you need to handle a lot of data on the daily basis? Then, you might have surely felt the need to extract a random sample from the data set. There are numerous ways to get rid of this problem. Don’t know all the ways? Continue reading the article further to know more about the random sample extraction in the Pyspark data set using Python.

Prerequisite

Note: In the article about installing Pyspark we have to install python instead of scala rest of the steps are the same.

Modules Required:

Pyspark: An open source, distributed computing framework and set of libraries for real-time, large-scale data processing API primarily developed for Apache Spark, is known as Pyspark.  This module can be installed through the following command in Python.

pip install pyspark

Student_data.csv file:

PySpark Random Sample with Example

 

Methods to get Pyspark Random Sample:

  • PySpark SQL Sample
    1. Using sample function
    2. Using sampleBy function
  • PySpark RDD Sample
    1. Using sample function
    2. Using takeSample function

PySpark SQL Sample

1. Using sample function:

Here we are using Sample Function to get the PySpark Random Sample.

Syntax: sample(withReplacement, fraction, seed=None)

Here,

  • withReplacement – Boolean value to get repeated values or not. True means duplicate values exist, while false means there are no duplicates. By default, the value is False. 
  • fraction – Fractional number which represents the fractional rows to generate in the range of 0 to 1.
  • seed – The seed for sampling which divides the data frame always in the same fractional parts until the seed value or weights value is changed.

Stepwise Implementation:

Step 1: First of all, import the required libraries, i.e. SparkSession. The SparkSession library is used to create the session.

from pyspark.sql import SparkSession

Step 2: Now, create a spark session using the getOrCreate function.

spark_session = SparkSession.builder.getOrCreate()

Step 3: Then, read the CSV file and display it to see if it is correctly uploaded.

data_frame=csv_file = spark_session.read.csv('#Path of CSV file', sep = ',', inferSchema = True, header = True)
data_frame.show()

Step 4: Finally, extract the random sample of the data frame using the sample function with withReplacement, fraction, and seed as arguments.

data_frame.sample(withReplacement, fraction, seed=None)

Example 1:

In this example, we have extracted the sample from the data frame i.e., the dataset of 5×5, through the sample function by only a fraction as an argument. We have extracted the random sample twice through the sample function to see if we get the same fractional value each time. What we observed is that we got different values each time.

Python3




# Python program to extract Pyspark random sample 
# through sample function with fraction as argument
  
# Import the SparkSession library
from pyspark.sql import SparkSession
  
# Create a spark session using getOrCreate() function
spark_session = SparkSession.builder.getOrCreate()
  
# Read the CSV file
data_frame=csv_file = spark_session.read.csv('/content/student_data.csv',
                           sep = ',', inferSchema = True, header = True)
  
# Extract random sample through sample 
# function using only fraction as argument 
data_frame.sample(0.4).collect()
  
# Again extract random sample through sample function using only 
# fraction as argument to check if we get same output each time
data_frame.sample(0.4).collect()


Output:

When we run the sample command for the first time, we got the following output:

PySpark Random Sample with Example

 

When we run the sample command for the second time, we got the following output:

PySpark Random Sample with Example

 

Example 2:

In this example, we have extracted the sample from the data frame (link)i.e., the dataset of 5×5, through the sample function by a fraction as well as seed as arguments. We have extracted the sample twice through the sample function to see if we get the same fractional value each time. What we observed is that we got the same values each time.

Python3




# Python program to extract Pyspark random sample through 
# sample function with fraction and seed as arguments
  
# Import the SparkSession library
from pyspark.sql import SparkSession
  
# Create a spark session using getOrCreate() function
spark_session = SparkSession.builder.getOrCreate()
  
# Read the CSV file
data_frame=csv_file = spark_session.read.csv('/content/student_data.csv',
                            sep = ',', inferSchema = True, header = True)
  
# Extract random sample through sample function 
# using seed and fraction as arguments 
data_frame.sample(0.4,26).collect()
  
# Again extract random sample through sample function using seed and
# fraction as arguments to check if we get same output each time
data_frame.sample(0.4,26).collect()


Output:

When we run the sample command for the first time, we got the following output:

PySpark Random Sample with Example

 

When we run the sample command for the second time, we got the following output:

PySpark Random Sample with Example

 

Example 3:

In this example, we have extracted the sample from the data frame i.e., the dataset of 5×5, through the sample function by a fraction and withReplacement as arguments. We have extracted the sample twice through the sample function, one time by using the False value of withReplacement variable, and the second time by using the True value of withReplacement variable. What we observed is that by using False as a variable value, it didn’t give repeated values, while with True as a variable value, it gave some repeated values.

Python3




# Python program to extract Pyspark random sample through
# sample function with fraction and withReplacement as arguments
  
# Import the SparkSession library
from pyspark.sql import SparkSession
  
# Create a spark session using getOrCreate() function
spark_session = SparkSession.builder.getOrCreate()
  
# Read the CSV file
data_frame=csv_file = spark_session.read.csv('/content/student_data.csv',
                            sep = ',', inferSchema = True, header = True)
  
# Extract random sample through sample function using 
# withReplacement (value=True) and fraction as arguments 
data_frame.sample(True,0.8).collect()
  
# Again extract random sample through sample function using 
# withReplacement (value=False) and fraction as arguments 
data_frame.sample(False,0.8).collect()


Output:

When we run the sample command for the first time, we got the following output:

PySpark Random Sample with Example

 

When we run the sample command for the second time, we got the following output:

PySpark Random Sample with Example

 

2. Using sampleBy function

Syntax: sampleBy(column, fractions, seed=None)

Here,

  • column – column name from DataFrame
  • fractions – The values of the particular column in the form of a dictionary which takes key and value as parameters.
  • seed – The seed for sampling which divides the data frame always in the same fractional parts until the seed value or weights value is changed.

Stepwise Implementation:

Step 1: First of all, import the required libraries, i.e. SparkSession. The SparkSession library is used to create the session.

from pyspark.sql import SparkSession

Step 2: Now, create a spark session using the getOrCreate function.

spark_session = SparkSession.builder.getOrCreate()

Step 3: Then, read the CSV file and display it to see if it is correctly uploaded.

data_frame=csv_file = spark_session.read.csv('#Path of CSV file', sep = ',', inferSchema = True, header = True)
data_frame.show()

Step 4: Finally, extract the random sample of the data frame using the sampleBy function with column, fractions, and seed as arguments.

data_frame.sampleBy(column, fractions, seed=None)

Example:

In this example, we have extracted the sample from the data frame (link) i.e., the dataset of 5×5, through the sampleBy function by column, fractions, and seed as arguments.

Python3




# Python program to extract Pyspark random sample through
# sampleBy function with column, fraction and seed as arguments
  
# Import the SparkSession library
from pyspark.sql import SparkSession
  
# Create a spark session using getOrCreate() function
spark_session = SparkSession.builder.getOrCreate()
  
# Read the CSV file
data_frame=csv_file = spark_session.read.csv('/content/student_data.csv',
                              sep = ',', inferSchema = True, header = True)
  
# Extract random sample through sample function 
# using column, fraction and seed as arguments 
print(data_frame.sampleBy(data_frame.fees, {18000: 0.4, 15000: 0.6},0).collect())


Output:

PySpark Random Sample with Example

 

Pyspark RDD Sample

1. Using sample function:

Here we are using Sample Function to get the PySpark Random Sample.

Syntax: sample(withReplacement, fraction, seed=None)

Stepwise Implementation:

Step 1: First of all, import the required libraries, i.e. SparkSession. The SparkSession library is used to create the session.

from pyspark.sql import SparkSession

Step 2: Now, create a spark session using the getOrCreate function.

spark_session = SparkSession.builder.getOrCreate()

Step 3: Then, read the CSV file and display it to see if it is correctly uploaded.

data_frame=csv_file = spark_session.read.csv('#Path of CSV file', sep = ',', inferSchema = True, header = True)
data_frame.show()

Step 4: Next, convert the data frame to an RDD data frame for performing sampling operations.

data_frame_rdd=data_frame.rdd

Step 5: Finally, extract the random sample of the data frame using the sample function with withReplacement, fraction, and seed as arguments.

data_frame_rdd.sample(withReplacement, fraction, seed=None)

Example:

In this example, we have extracted the sample from the data frame ,i.e., the dataset of 5×5, through the sample function by a fraction and withReplacement as arguments. We have extracted the sample twice through the sample function, one time by using the False value of withReplacement variable, and the second time by using the True value of withReplacement variable. What we observed is that by using False as a variable value, it didn’t give repeated values, while with True as a variable value, it gave some repeated values.

Python3




# Python program to extract Pyspark random sample through
# sample function with fraction and seed as arguments
  
# Import the SparkSession library
from pyspark.sql import SparkSession
  
# Create a spark session using getOrCreate() function
spark_session = SparkSession.builder.getOrCreate()
  
# Read the CSV file
data_frame=csv_file = spark_session.read.csv('/content/student_data.csv',
                            sep = ',', inferSchema = True, header = True)
  
# Convert the data frame to RDD dataframe
data_frame_rdd=data_frame.rdd
  
# Extract random sample through sample function using 
# withReplacement (value=True) and fraction as arguments
print(data_frame_rdd.sample(True,0.2).collect())
  
# Again extract random sample through sample function using 
# withReplacement (value=False) and fraction as arguments
print(data_frame_rdd.sample(False,0.2).collect())


Output:

When we run the sample command for the first time, we got the following output:

PySpark Random Sample with Example

 

When we run the sample command for the second time, we got the following output:

 

2. Using takeSample function

Here we are using TakeSample Function to get the PySpark Random Sample.

Syntax:takeSample(withReplacement, num, seed=None) 

Stepwise Implementation:

Step 1: First of all, import the required libraries, i.e. SparkSession. The SparkSession library is used to create the session.

from pyspark.sql import SparkSession

Step 2: Now, create a spark session using the getOrCreate function.

spark_session = SparkSession.builder.getOrCreate()

Step 3: Then, read the CSV file and display it to see if it is correctly uploaded.

data_frame=csv_file = spark_session.read.csv('#Path of CSV file', sep = ',', inferSchema = True, header = True)
data_frame.show()

Step 4: Next, convert the data frame to an RDD data frame for performing sampling operations.

data_frame_rdd=data_frame.rdd

Step 5: Finally, extract the random sample of the data frame using the sample function with withReplacement, num, and seed as arguments.

data_frame_rdd.takeSample(withReplacement, num, seed=None)

Example:

In this example, we have extracted the sample from the data frame,i.e., the dataset of 5×5, through the takeSample function by num and withReplacement as arguments. We have extracted the sample twice through the sample function, one time by using the False value of withReplacement variable, and the second time by using the True value of withReplacement variable. What we observed is that by using False as a variable value, it didn’t give repeated values, while with True as a variable value, it gave some repeated values.

Python3




# Python program to extract Pyspark random sample through
# takeSample function with withReplacement, num and seed as arguments
  
# Import the SparkSession library
from pyspark.sql import SparkSession
  
# Create a spark session using getOrCreate() function
spark_session = SparkSession.builder.getOrCreate()
  
# Read the CSV file
data_frame=csv_file = spark_session.read.csv('/content/student_data.csv',
                             sep = ',', inferSchema = True, header = True)
  
# Convert the data frame to RDD dataframe
data_frame_rdd=data_frame.rdd
  
# Extract random sample through takeSample function using 
# withReplacement (value=True), num and seed as arguments
print(data_frame_rdd.takeSample(True,2,2))
  
# Again extract random sample through takeSample function using 
# withReplacement (value=False), num and seed as arguments
print(data_frame_rdd.takeSample(False,2,2))


Output:

When we run the sample command for the first time, we got the following output:

PySpark Random Sample with Example

 

When we run the sample command for the second time, we got the following output:

PySpark Random Sample with Example

 



Like Article
Suggest improvement
Previous
Next
Share your thoughts in the comments

Similar Reads