Show partitions on a Pyspark RDD
Did you ever get the requirement to show the partitions on a Pyspark RDD for the data frame you uploaded or partition the data and check if has been correctly partitioned? Don’t know, how to achieve this. You can do this by using the getNumPartitions functions of Pyspark RDD. Want to know more about it? Read the article further, where we will discuss the same.
Show partitions on a Pyspark RDD in Python
Pyspark: An open source, distributed computing framework and set of libraries for real-time, large-scale data processing API primarily developed for Apache Spark, is known as Pyspark. This module can be installed through the following command in Python:
To get the number of partitions on pyspark RDD, you need to convert the data frame to RDD data frame. For showing partitions on Pyspark RDD use:
data_frame_rdd.getNumPartitions()
First of all, import the required libraries, i.e. SparkSession. The SparkSession library is used to create the session. Now, create a spark session using the getOrCreate function. Then, read the CSV file and display it to see if it is correctly uploaded. Next, convert the data frame to the RDD data frame. Finally, get the number of partitions using the getNumPartitions function.
Example 1:
In this example, we have read the CSV file (link) and shown partitions on Pyspark RDD using the getNumPartitions function.
Python3
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
data_frame = csv_file = spark.read.csv( 'california_housing_train.csv' ,
sep = ',' , inferSchema = True ,
header = True )
data_frame.show()
data_frame_rdd = data_frame.rdd
print (data_frame_rdd.getNumPartitions())
|
Output:
Example 2:
In this example, we have read the CSV file (link) and shown partitions on Pyspark RDD using the getNumPartitions function. Further, we have repartitioned that data and again shown partitions on Pyspark RDD of the new partitioned data.
Python3
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
data_frame_1 = csv_file = spark.read.csv(california_housing_train.csv',
sep = ',' , inferSchema = True ,
header = True )
data_frame_1.show()
data_frame_rdd_1 = data_frame_1.rdd
print (data_frame_rdd_1.getNumPartitions())
data_frame_2 = data_frame_1.select(data_frame_1.longitude,
data_frame_1.latitude,
data_frame_1.housing_median_age,
data_frame_1.total_rooms).repartition( 4 )
data_frame_rdd_2 = data_frame_2.rdd
print (data_frame_rdd_2.getNumPartitions())
|
Output:
Last Updated :
19 Dec, 2022
Like Article
Save Article
Share your thoughts in the comments
Please Login to comment...