How to Get the Number of Elements in Pyspark Partition

Last Updated : 28 Dec, 2022

In this article, we are going to learn how to get the number of elements in a partition using Pyspark in Python.

Are you a data enthusiast who has ever worked on a Pyspark data frame? Then, you might surely know that whenever we upload any file in the Pyspark, it creates a partition of that data equal to the number of cores. You can repartition that data and divide it into as many partitions according to your wish. Thus, after partitioning, if you want to know how many elements exist in every RDD data frame partition, you can achieve it using the function of the Pyspark module. In this article, we will discuss the same.

Prerequisite

Note: In the article about installing Pyspark we have to install python instead of scala rest of the steps are the same.

Modules Required:

Pyspark: An open source, distributed computing framework and set of libraries for real-time, large-scale data processing API primarily developed for Apache Spark, is known as Pyspark. You can install the following module through the following command in Python:

pip install pyspark

Methods to get the number of elements in a partition:

Using spark_partition_id() function
Using map() function

Method 1: Using the spark_partition_id() function

In this method, we are going to make the use of spark_partition_id() function to get the number of elements of the partition in a data frame.

Stepwise Implementation:

Step 1: First of all, import the required libraries, i.e. SparkSession, and spark_partition_id. The SparkSession library is used to create the session while spark_partition_id is used to get the record count per partition.

from pyspark.sql import SparkSession
from pyspark.sql.functions import spark_partition_id

Step 2: Now, create a spark session using the getOrCreate function.

spark_session = SparkSession.builder.getOrCreate()

Step 3: Then, read the CSV file for which you want to check the number of elements in the partition.

data_frame=csv_file = spark_session.read.csv('#Path of CSV file', sep = ',', inferSchema = True, header = True)

Step 4: Finally, get the number of elements of partition using the spark_partition_id function.

data_frame.withColumn("partitionId",spark_partition_id()).groupBy("partitionId").count().show()

Example 1:

In this example, we have read the CSV file (link) and obtained the number of partitions as well as the number of elements per partition using the spark_partition_id function.

Python3

# Python program to get number of elements in partition 
  
# Import the SparkSession, spark_partition_id libraries 
from pyspark.sql import SparkSession 
from pyspark.sql.functions import spark_partition_id 
  
# Create a spark session using getOrCreate() function 
spark_session = SparkSession.builder.getOrCreate() 
  
# Read the CSV file 
# here csv file is stored in the same folder 
data_frame = csv_file = spark_session.read.csv('california_housing_train.csv', 
                                        sep=',', inferSchema=True, header=True) 
  
# Get number of elements of partition using spark_partition_id function 
data_frame.withColumn("partitionId", spark_partition_id() 
                      ).groupBy("partitionId").count().show() 

Output:

How to Get the Number of Elements in Pyspark Partition

Example 2:

In this example, we have read the CSV file (link) and obtained the number of partitions as well as the number of elements per partition using the spark_partition_id function. Further, we have repartitioned that data and again get the number of partitions as well as the record count per transition of the new partitioned data.

Python3

# Python program to get number of elements in partition 
  
# Import the SparkSession, spark_partition_id libraries 
from pyspark.sql import SparkSession 
from pyspark.sql.functions import spark_partition_id 
  
# Create a spark session using getOrCreate() function 
spark_session = SparkSession.builder.getOrCreate() 
  
# Read the CSV file 
data_frame = csv_file = spark_session.read.csv('california_housing_train.csv', 
                                        sep=',', inferSchema=True, header=True) 
  
# Get number of elements in partition using spark_partition_id function 
data_frame.withColumn("partitionId", spark_partition_id() 
                      ).groupBy("partitionId").count().show() 
  
# Repartition the CSV file by longitude and latitude columns 
data_frame_partition = data_frame.select( 
    data_frame.longitude, data_frame.latitude, 
    data_frame.housing_median_age, 
    data_frame.total_rooms).repartition(4) 
  
# Get number of elements in partition again using spark_partition_id function 
data_frame_partition.withColumn("partitionId", 
   spark_partition_id()).groupBy("partitionId").count().show() 

Output:

When we get the number of elements in the partition before repartitioning, we got the following output:

When we get the number of elements in the partition after repartitioning, we got the following output:

Method 2: Using the map function

In this method, we are going to make the use of map() function with glom() function to get the number of elements of the partition in a data frame.

Stepwise Implementation:

Step 1: First of all, import the required libraries, i.e. SparkSession. The SparkSession library is used to create the session.

from pyspark.sql import SparkSession

Step 2: Now, create a spark session using the getOrCreate function.

spark_session = SparkSession.builder.getOrCreate()

Step 3: Later on, create the Spark Context Session.

sc = spark_session.sparkContext

Step 4: Then, read the CSV file of which we want to know the number of partitions or enter the dataset with the number of partitions you want to do of that dataset.

data_frame=csv_file = spark_session.read.csv('#Path of CSV file', sep = ',', inferSchema = True, header = True)

num_partitions = Declare_number_of_partitions_to_be_done
data_frame = sc.parallelize(Declare_the_dataset, num_partitions)

Step 5: Further, get the length of each partition of the data frame using glom and map function and using collect() to retrieve data.

l = data_frame.glom().map(len).collect()

Step 6: Finally, print the length of each partition obtained in the previous step.

print(l)

Example:

In this example, we have declared a dataset and the number of partitions to be done on it. Then, we applied the glom and map function on the data set and obtained the number of elements in the partition.

Python3

# Python program to get number of elements in partition 
  
# Import the SparkSession library 
from pyspark.sql import SparkSession 
  
# Create a spark session using getOrCreate() function 
spark_session = SparkSession.builder.getOrCreate() 
  
# Create a SparkContext 
sc = spark_session.sparkContext 
  
# Declare the number of partitions 
# we want to do of data set 
num_partitions = 10
  
# Declare the data set along 
# with the number of partitions 
data_frame = sc.parallelize(range(int(100)), 
                            num_partitions) 
  
# Get number of elements in each partition 
l = data_frame.glom().map(len).collect() 
  
# Get number of elements in partition 
print(l) 

Output:

Suggest improvement

Token Authentication in Django Channels and Websockets

Bulk Insert to Pandas DataFrame Using SQLAlchemy - Python

Share your thoughts in the comments

How to Get the Number of Elements in Pyspark Partition

Prerequisite

Modules Required:

Methods to get the number of elements in a partition:

Method 1: Using the spark_partition_id() function

Stepwise Implementation:

Python3

Python3

Method 2: Using the map function

Stepwise Implementation:

Python3

Please Login to comment...

Similar Reads

What kind of Experience do you want to share?