Open In App

PySpark sampleBy using multiple columns

Last Updated : 03 Jan, 2023
Improve
Improve
Like Article
Like
Save
Share
Report

In this article, we are going to learn about PySpark sampleBy using multiple columns in Python.

While doing the data processing of the big data. There are many cases where we need a sample of data. In Pyspark, we can get the sample of data by using sampleBy() function to get the sample of data. In this article, we are going to learn how to take samples using multiple columns through sampleBy() function.

sampleBy() function:

The function which returns a stratified sample without replacement based on the fraction given on each stratum is known as sampleBy(). It not only defines strata but also adds sampling by a column.

Syntax: DataFrame.sampleBy(col, fractions, seed=None)

Parameters:

  • col: It can be column or string that defines strata.
  • fraction: It is the fraction between 0 and 1 according to which sampleBy will be done.
  • seed: Random seed (Optional)

Returns: A new DataFrame that represents the stratified sample.

Steps of PySpark sampleBy using multiple columns

Step 1: First of all, import the SparkSession library. The SparkSession library is used to create the session. 

from pyspark.sql import SparkSession

Step 2: Now, create a spark session using getOrCreate() function. 

spark_session = SparkSession.builder.getOrCreate()

Step 3: Then, either create the data frame using the createDataFrame() function or read the CSV file. 

data_frame=csv_file = spark_session.read.csv('#Path of CSV file',
                                              sep = ',', inferSchema = True, header = True)

or 

data_frame=spark_session.createDataFrame([(column_data_1), (column_data_2 ), (column_data_3 )],
                                         ['column_name_1', 'column_name_2','column_name_3']

Step 4: Later on, store the data frame in another variable as it will be used during sampling. 

df=data_frame

Step 5: Further, apply a transformation on every element by defining the columns as well as sampling percentage as an argument in the map() function. 

fractions = df.rdd.map(lambda x:
                       (x[column_index_1],
                       x[column_index_2])).distinct().map(lambda x:
                                                          (x,fraction)).collectAsMap()

Step 6: Moreover, create a tuple of elements using the keyBy() function.  

key_df = df.rdd.keyBy(lambda x: (x[column_index_1],x[column_index_2]))

Step 7: Finally, extract random sample through sampleByKey() function using boolean, fraction, and seed as arguments and display the data frame.

key_df.sampleByKey(False,fractions).map(lambda x:
                                       x[column_index_1]).toDF(data_frame.columns).show()

Example 1:

In this example, we have created the data frame with columns ‘Roll_Number,’ ‘Fees‘ and ‘Fine‘, and then extracted the data from it through the sampleByKey() function by boolean, multiple columns (‘Roll_Number‘ and ‘Fees‘) and fraction as arguments. We have extracted the random sample twice through the sampleByKey() function to see if we get the same fractional value each time. What we observed is that we got different values each time.

Python3




# Pyspark program to sampleBy using multiple columns
 
# Import the libraries SparkSession library
from pyspark.sql import SparkSession
 
# Create a spark session using getOrCreate() function
spark_session = SparkSession.builder.getOrCreate()
 
# Create a data frame with three columns 'Roll_Number,' 'Fees' and 'Fine'
data_frame=spark_session.createDataFrame([(1, 10000, 400),
                                         (2, 14000 , 500),
                                         (3, 12000 , 800)],
                           ['Roll_Number', 'Fees', 'Fine'])
 
# Store the data frame in another variable
# as it will be used during sampling
df=data_frame
print("Data frame:")
df.show()
 
# Apply transformation on every element by defining the columns (first, second)
# as well as sampling percentage as an argument in the map function
fractions = df.rdd.map(lambda x:
   (x[0],x[1])).distinct().map(lambda x:
                    (x,0.4)).collectAsMap()
 
# Create tuple of elements using keyBy function
key_df = df.rdd.keyBy(lambda x: (x[0],x[1]))
 
# Extract random sample through sampleByKey function
# using boolean, columns (first and second) and fraction as arguments
print("Sample 1: ")
key_df.sampleByKey(False,
   fractions).map(lambda x:
                  x[1]).toDF(data_frame.columns).show()
 
# Again extract random sample through sampleByKey function
# using boolean, columns (first and second) and fraction as arguments
print("Sample 2: ")
key_df.sampleByKey(False,
   fractions).map(lambda x:
                  x[1]).toDF(data_frame.columns).show()


Output:

Data frame:
+-----------+-----+----+
|Roll_Number| Fees|Fine|
+-----------+-----+----+
|          1|10000| 400|
|          2|14000| 500|
|          3|12000| 800|
+-----------+-----+----+

Sample 1: 
+-----------+-----+----+
|Roll_Number| Fees|Fine|
+-----------+-----+----+
|          3|12000| 800|
+-----------+-----+----+

Sample 2: 
+-----------+-----+----+
|Roll_Number| Fees|Fine|
+-----------+-----+----+
|          3|12000| 800|
+-----------+-----+----+

Example 2:

In this example, we have taken the data frame from the CSV file (link) and then extracted the data from it through the sampleByKey() function by boolean, multiple columns (‘Class,’ ‘Fees‘ and ‘Discount‘), fraction and seed as arguments. We have extracted the random sample twice through the sampleByKey() function to see if we get the same fractional value each time. What we observed is that we got the same values each time.

Python3




# Pyspark program to sampleBy using multiple columns
 
# Import the libraries SparkSession library
from pyspark.sql import SparkSession
 
# Create a spark session using getOrCreate() function
spark_session = SparkSession.builder.getOrCreate()
 
# Read the CSV file
data_frame = csv_file = spark_session.read.csv(
    '/content/drive/MyDrive/Colab Notebooks/class_data.csv',
     sep=',', inferSchema=True, header=True)
 
# Store the data frame in another variable
# as it will be used during sampling
df = data_frame
print("Data frame: ")
df.show()
 
# Apply transformation on every element by defining the columns
# (third, fourth and fifth) as well as sampling percentage
# as an argument in the map function
fractions = df.rdd.map(lambda x:
    (x[2], x[3], x[4])).distinct().map(
    lambda x: (x, 0.4)).collectAsMap()
 
# Create tuple of elements using keyBy function
key_df = df.rdd.keyBy(lambda x:
                      (x[2], x[3], x[4]))
 
# Extract random sample through sampleByKey function using boolean,
# columns (third, fourth and fifth), fraction and seed (value=2) as arguments
print("Sample 1: ")
key_df.sampleByKey(True, fractions, 4).map(
    lambda x: x[1]).toDF(data_frame.columns).show()
 
# Again extract random sample through sampleByKey function using boolean,
# columns (third, fourth and fifth),fraction and seed (value=2) as arguments
print("Sample 2: ")
key_df.sampleByKey(True, fractions, 4).map(
    lambda x: x[1]).toDF(data_frame.columns).show()


Output:

Data frame: 
+-------+--------------+-----+-----+--------+
|   name|       subject|class| fees|discount|
+-------+--------------+-----+-----+--------+
|   Arun|         Maths|   10|12000|     400|
| Aniket|Social Science|   11|15000|     600|
| Ishita|       English|    9| 9000|       0|
|Pranjal|       Science|   12|18000|    1000|
|Vinayak|      Computer|   12|18000|     500|
+-------+--------------+-----+-----+--------+

Sample 1: 
+------+-------+-----+----+--------+
|  name|subject|class|fees|discount|
+------+-------+-----+----+--------+
|Ishita|English|    9|9000|       0|
+------+-------+-----+----+--------+

Sample 2: 
+------+-------+-----+----+--------+
|  name|subject|class|fees|discount|
+------+-------+-----+----+--------+
|Ishita|English|    9|9000|       0|
+------+-------+-----+----+--------+


Like Article
Suggest improvement
Previous
Next
Share your thoughts in the comments

Similar Reads