Open In App

Show partitions on a Pyspark RDD

Improve
Improve
Like Article
Like
Save
Share
Report

Did you ever get the requirement to show the partitions on a Pyspark RDD for the data frame you uploaded or partition the data and check if has been correctly partitioned? Don’t know, how to achieve this. You can do this by using the getNumPartitions functions of Pyspark RDD. Want to know more about it? Read the article further, where we will discuss the same.

Show partitions on a Pyspark RDD in Python

Pyspark: An open source, distributed computing framework and set of libraries for real-time, large-scale data processing API primarily developed for Apache Spark, is known as Pyspark.  This module can be installed through the following command in Python:

To get the number of partitions on pyspark RDD, you need to convert the data frame to RDD data frame. For showing partitions on Pyspark RDD use:

data_frame_rdd.getNumPartitions()

First of all, import the required libraries, i.e. SparkSession. The SparkSession library is used to create the session. Now, create a spark session using the getOrCreate function. Then, read the CSV file and display it to see if it is correctly uploaded. Next, convert the data frame to the RDD data frame. Finally, get the number of partitions using the getNumPartitions function.

Example 1:

In this example, we have read the CSV file (link) and shown partitions on Pyspark RDD using the getNumPartitions function.

Python3




# Python program to show partitions on RDD pyspark
 
# Import the SparkSession  library.
from pyspark.sql import SparkSession
 
# Create a spark session using getOrCreate() function
spark = SparkSession.builder.getOrCreate()
 
# Read the CSV file
data_frame=csv_file = spark.read.csv('california_housing_train.csv',
                                     sep = ',', inferSchema = True,
                                     header = True)
 
# Display the csv file read
data_frame.show()
 
# Convert dataframe to RDD dataframe
data_frame_rdd=data_frame.rdd
 
# Show partitions on pyspark RDD using
# getNumPartitions function
print(data_frame_rdd.getNumPartitions())


Output:

 

Example 2:

In this example, we have read the CSV file (link) and shown partitions on Pyspark RDD using the getNumPartitions function. Further, we have repartitioned that data and again shown partitions on Pyspark RDD of the new partitioned data.

Python3




# Python program to show partitions on RDD pyspark
 
# Import the SparkSession  library.
from pyspark.sql import SparkSession
 
# Create a spark session using getOrCreate() function
spark = SparkSession.builder.getOrCreate()
 
# Read the CSV file
data_frame_1 = csv_file = spark.read.csv(california_housing_train.csv',
                                         sep=',', inferSchema=True,
                                         header=True)
 
# Display the csv file read
data_frame_1.show()
 
# Convert dataframe to RDD dataframe
data_frame_rdd_1 = data_frame_1.rdd
 
# Show partitions on pyspark RDD
# using getNumPartitions function
print(data_frame_rdd_1.getNumPartitions())
 
# Repartition the CSV file by longitude, latitude,
# housing_median_age, and total_rooms columns
data_frame_2 = data_frame_1.select(data_frame_1.longitude,
                                   data_frame_1.latitude,
                                   data_frame_1.housing_median_age,
                                   data_frame_1.total_rooms).repartition(4)
 
# Convert dataframe to RDD dataframe
data_frame_rdd_2 = data_frame_2.rdd
 
# Show partitions on pyspark RDD using getNumPartitions function
print(data_frame_rdd_2.getNumPartitions())


Output:

 



Last Updated : 19 Dec, 2022
Like Article
Save Article
Previous
Next
Share your thoughts in the comments
Similar Reads