Get current number of partitions of a DataFrame – Pyspark

In this article, we are going to learn how to get the current number of partitions of a data frame using Pyspark in Python.

In many cases, we need to know the number of partitions in large data frames. Sometimes we have partitioned the data and we need to verify if it has been correctly partitioned or not. There are various methods to get the current number of partitions of a data frame using Pyspark in Python.

Prerequisite

Note: In the article about installing Pyspark we have to install python instead of scala rest of the steps are the same.

Modules Required

Pyspark: The API which was introduced to support Spark and Python language and has features of Scikit-learn and Pandas libraries of Python is known as Pyspark. This module can be installed through the following command in Python:

pip install pyspark

Methods to get the current number of partitions of a DataFrame

Using getNumPartitions() function
Using spark_partition_id() function
Using map() function

Method 1: Using getNumPartitions() function

In this method, we are going to find the number of partitions in a data frame using getNumPartitions() function in a data frame.

Syntax: rdd.getNumPartitions()

Return type: This function return the numbers of partitions.

Stepwise Implementation:

Step 1: First of all, import the required libraries, i.e. SparkSession. The SparkSession library is used to create the session.

from pyspark.sql import SparkSession

Step 2: Now, create a spark session using the getOrCreate function.

spark_session = SparkSession.builder.getOrCreate()

Step 3: Then, read the CSV file in which you want to know the number of partitions.

data_frame=csv_file = spark_session.read.csv('#Path of CSV file',
                                             sep = ',', inferSchema = True, header = True)

Step 4: Finally, get the number of partitions using the getNumPartitions function.

print(data_frame.rdd.getNumPartitions())

Example:

In this example, we have read the given below CSV file and obtained the current number of partitions using the getNumPartitions function.

Python

# Python program to get current number of 
# partitions using getNumPartitions function 

# Import the SparkSession libraries 

from pyspark.sql import SparkSession 

# Create a spark session using getOrCreate() function 

spark_session = SparkSession.builder.getOrCreate() 

# Read the CSV file 

data_frame = csv_file = spark_session.read.csv( 

  '/content/class_data.csv', 

  sep=',', inferSchema=True, header=True) 

# Get current number of partitions 
# using getNumPartitions function 

print(data_frame.rdd.getNumPartitions())

Output:

Method 2: Using spark_partition_id() function

In this method, we are going to find the number of partitions using spark_partition_id() function which is used to return the partition id of the partitions in a data frame. With the use of partition id we can count the number of partitions as implemented below.

Stepwise Implementation:

Step 1: First of all, import the required libraries, i.e. SparkSession, spark_partition_id, and countDistinct. The SparkSession library is used to create the session, while spark_partition_id is used to return the partition Id of the partitions in the data frame. The countDistinct library is used to get the count distinct of the selected multiple columns.

from pyspark.sql import SparkSession
from pyspark.sql.functions  import spark_partition_id, countDistinct

Step 2: Now, create a spark session using the getOrCreate function.

spark_session = SparkSession.builder.getOrCreate()

Step 3: Then, read the CSV file in which you want to know the number of partitions.

data_frame=csv_file = spark_session.read.csv('#Path of CSV file',
                                              sep = ',', inferSchema = True, header = True)

Step 4: Finally, get the current number of partitions using spark_partition_id and countDistinct function.

data_frame.withColumn("partitionid",
                      spark_partition_id()).select(
                                         "partitionid").agg(countDistinct("partitionid")).count()

Example:

In this example, we have read the same CSV file as in the first method and obtained the current number of partitions using the spark_partition_id and countDistinct() functions.

Python3

# Python program to get current number of  
# partitions using spark_partition_id function 

# Import the SparkSession, spark_partition_id  
# and countDistinct libraries 

from pyspark.sql import SparkSession 

from pyspark.sql.functions  import spark_partition_id, 

                                   countDistinct 

# Create a spark session using getOrCreate() function 

spark_session = SparkSession.builder.getOrCreate() 

# Read the CSV file 

data_frame=csv_file = spark_session.read.csv( 

  '/content/class_data.csv', 

  sep = ',', inferSchema = True, header = True) 

# Get current number of partitions using spark_partition_id function 
data_frame.withColumn( 

  "partitionid",spark_partition_id()).select( 

  "partitionid").agg(countDistinct("partitionid")).count()

Output:

Method 3: Using map() function

In this article, we are going to use the map() function to find the current number of partitions of a DataFrame which is used to get the length of each partition of the data frame.

Stepwise Implementation:

Step 1: First of all, import the required libraries, i.e. SparkSession. The SparkSession library is used to create the session.

from pyspark.sql import SparkSession

Step 2: Now, create a spark session using the getOrCreate function.

spark_session = SparkSession.builder.getOrCreate()

Step 3: Later on, create the Spark Context Session.

sc = spark_session.sparkContext

Step 4: Then, read the CSV file of which we want to know the number of partitions or enter the dataset with the number of partitions you want to do of that dataset.

data_frame=csv_file = spark_session.read.csv('#Path of CSV file',
                                              sep = ',', inferSchema = True, header = True)

num_partitions = #Declare number of partitions to be done
data_frame = sc.parallelize(#Declare the dataset, num_partitions)

Step 5: Further, get the length of each partition of the data frame using glom() and map() function.

l=data_frame.glom().map(len).collect()

Step 6: Finally, obtain the current number of partitions using the length function on the list obtained in the previous step.

print(len(l))

Example:

In this example, we have declared a dataset and the number of partitions to be done on it. Then, we applied the glom and map function on the data set and checked if we get the appropriate number of partitions which we did on the data set.

Python3

# Python program to get length of each partition  
# of data frame using  glom and map function  

# Import the SparkSession library 

from pyspark.sql import SparkSession 

# Create a spark session using getOrCreate() function 

spark_session = SparkSession.builder.getOrCreate() 

# Create a SparkContext 

sc = spark_session.sparkContext 

# Declare the number of partitions 
# we want to do of data set 

num_partitions = 10

# Declare the data set along  
# with the number of partitions 

data_frame = sc.parallelize(range(int(100)), 

                            num_partitions) 

# Get length of each partition of data  
# frame using  glom and map function  

l=data_frame.glom().map(len).collect() 

# Get current number of partitions using len function 

print(len(l))

Output:

Article Tags :

Python

Technical Scripter

Technical Scripter 2022