PySpark sampleBy using multiple columns
Last Updated :
03 Jan, 2023
In this article, we are going to learn about PySpark sampleBy using multiple columns in Python.
While doing the data processing of the big data. There are many cases where we need a sample of data. In Pyspark, we can get the sample of data by using sampleBy() function to get the sample of data. In this article, we are going to learn how to take samples using multiple columns through sampleBy() function.
sampleBy() function:
The function which returns a stratified sample without replacement based on the fraction given on each stratum is known as sampleBy(). It not only defines strata but also adds sampling by a column.
Syntax: DataFrame.sampleBy(col, fractions, seed=None)
Parameters:
- col: It can be column or string that defines strata.
- fraction: It is the fraction between 0 and 1 according to which sampleBy will be done.
- seed: Random seed (Optional)
Returns: A new DataFrame that represents the stratified sample.
Steps of PySpark sampleBy using multiple columns
Step 1: First of all, import the SparkSession library. The SparkSession library is used to create the session.
from pyspark.sql import SparkSession
Step 2: Now, create a spark session using getOrCreate() function.
spark_session = SparkSession.builder.getOrCreate()
Step 3: Then, either create the data frame using the createDataFrame() function or read the CSV file.
data_frame=csv_file = spark_session.read.csv('#Path of CSV file',
sep = ',', inferSchema = True, header = True)
or
data_frame=spark_session.createDataFrame([(column_data_1), (column_data_2 ), (column_data_3 )],
['column_name_1', 'column_name_2','column_name_3']
Step 4: Later on, store the data frame in another variable as it will be used during sampling.
df=data_frame
Step 5: Further, apply a transformation on every element by defining the columns as well as sampling percentage as an argument in the map() function.
fractions = df.rdd.map(lambda x:
(x[column_index_1],
x[column_index_2])).distinct().map(lambda x:
(x,fraction)).collectAsMap()
Step 6: Moreover, create a tuple of elements using the keyBy() function.
key_df = df.rdd.keyBy(lambda x: (x[column_index_1],x[column_index_2]))
Step 7: Finally, extract random sample through sampleByKey() function using boolean, fraction, and seed as arguments and display the data frame.
key_df.sampleByKey(False,fractions).map(lambda x:
x[column_index_1]).toDF(data_frame.columns).show()
Example 1:
In this example, we have created the data frame with columns ‘Roll_Number,’ ‘Fees‘ and ‘Fine‘, and then extracted the data from it through the sampleByKey() function by boolean, multiple columns (‘Roll_Number‘ and ‘Fees‘) and fraction as arguments. We have extracted the random sample twice through the sampleByKey() function to see if we get the same fractional value each time. What we observed is that we got different values each time.
Python3
from pyspark.sql import SparkSession
spark_session = SparkSession.builder.getOrCreate()
data_frame = spark_session.createDataFrame([( 1 , 10000 , 400 ),
( 2 , 14000 , 500 ),
( 3 , 12000 , 800 )],
[ 'Roll_Number' , 'Fees' , 'Fine' ])
df = data_frame
print ( "Data frame:" )
df.show()
fractions = df.rdd. map ( lambda x:
(x[ 0 ],x[ 1 ])).distinct(). map ( lambda x:
(x, 0.4 )).collectAsMap()
key_df = df.rdd.keyBy( lambda x: (x[ 0 ],x[ 1 ]))
print ( "Sample 1: " )
key_df.sampleByKey( False ,
fractions). map ( lambda x:
x[ 1 ]).toDF(data_frame.columns).show()
print ( "Sample 2: " )
key_df.sampleByKey( False ,
fractions). map ( lambda x:
x[ 1 ]).toDF(data_frame.columns).show()
|
Output:
Data frame:
+-----------+-----+----+
|Roll_Number| Fees|Fine|
+-----------+-----+----+
| 1|10000| 400|
| 2|14000| 500|
| 3|12000| 800|
+-----------+-----+----+
Sample 1:
+-----------+-----+----+
|Roll_Number| Fees|Fine|
+-----------+-----+----+
| 3|12000| 800|
+-----------+-----+----+
Sample 2:
+-----------+-----+----+
|Roll_Number| Fees|Fine|
+-----------+-----+----+
| 3|12000| 800|
+-----------+-----+----+
Example 2:
In this example, we have taken the data frame from the CSV file (link) and then extracted the data from it through the sampleByKey() function by boolean, multiple columns (‘Class,’ ‘Fees‘ and ‘Discount‘), fraction and seed as arguments. We have extracted the random sample twice through the sampleByKey() function to see if we get the same fractional value each time. What we observed is that we got the same values each time.
Python3
from pyspark.sql import SparkSession
spark_session = SparkSession.builder.getOrCreate()
data_frame = csv_file = spark_session.read.csv(
'/content/drive/MyDrive/Colab Notebooks/class_data.csv' ,
sep = ',' , inferSchema = True , header = True )
df = data_frame
print ( "Data frame: " )
df.show()
fractions = df.rdd. map ( lambda x:
(x[ 2 ], x[ 3 ], x[ 4 ])).distinct(). map (
lambda x: (x, 0.4 )).collectAsMap()
key_df = df.rdd.keyBy( lambda x:
(x[ 2 ], x[ 3 ], x[ 4 ]))
print ( "Sample 1: " )
key_df.sampleByKey( True , fractions, 4 ). map (
lambda x: x[ 1 ]).toDF(data_frame.columns).show()
print ( "Sample 2: " )
key_df.sampleByKey( True , fractions, 4 ). map (
lambda x: x[ 1 ]).toDF(data_frame.columns).show()
|
Output:
Data frame:
+-------+--------------+-----+-----+--------+
| name| subject|class| fees|discount|
+-------+--------------+-----+-----+--------+
| Arun| Maths| 10|12000| 400|
| Aniket|Social Science| 11|15000| 600|
| Ishita| English| 9| 9000| 0|
|Pranjal| Science| 12|18000| 1000|
|Vinayak| Computer| 12|18000| 500|
+-------+--------------+-----+-----+--------+
Sample 1:
+------+-------+-----+----+--------+
| name|subject|class|fees|discount|
+------+-------+-----+----+--------+
|Ishita|English| 9|9000| 0|
+------+-------+-----+----+--------+
Sample 2:
+------+-------+-----+----+--------+
| name|subject|class|fees|discount|
+------+-------+-----+----+--------+
|Ishita|English| 9|9000| 0|
+------+-------+-----+----+--------+
Like Article
Suggest improvement
Share your thoughts in the comments
Please Login to comment...