Partitioning by multiple columns in PySpark with columns in a list
Pyspark offers the users numerous functions to perform on the dataset. One such function which seems to be too useful is Pyspark, which operates on group of rows and return single value for every input. Do you know that you can even the partition the dataset through the Window function? Not only partitioning is possible through one column, but you can partition the dataset through various columns. In this article, we will discuss the same, i.e., partitioning by multiple columns in PySpark with columns in a list.
Pyspark: An open source, distributed computing framework and set of libraries for real-time, large-scale data processing API primarily developed for Apache Spark, is known as Pyspark. This module can be installed through the following command in Python:
pip install pyspark
Stepwise Implementation of :
Step 1: First of all, import the required libraries, i.e. SparkSession, and Window. The SparkSession library is used to create the session, while the Window function returns a single value for every input row. Also, you can import any other libraries like functions or row number for the operations you want to perform on the dataset after partitioning by multiple column is done.
from pyspark.sql import SparkSession from pyspark.sql.window import Window
Step 2: Now, create a spark session using the getOrCreate function.
spark_session = SparkSession.builder.getOrCreate()
Step 3: Then, read the CSV file and display it to see if it is correctly uploaded.
data_frame=csv_file = spark_session.read.csv('#Path of CSV file', sep = ',', inferSchema = True, header = True)
Step 4: Later on, declare a list of columns according to which partition has to be done.
column_list = ["#column-1","#column-2"]
Step 5: Next, partition the data through the columns in the list declared in last step and rearrange the data through any column name using the Window function.
Windowspec = Window.partitionBy(column_list).orderBy("#column-n")
Step 6: Finally, perform the action on the partitioned data set whether it is adding row number to the dataset or giving a lag to any column and displaying it in new column.
data_frame.withColumn('Updated Column', func.lag(data_frame['#column-name']).over(Windowspec)).show()
In this example, we have used a data frame (link), i.e., a data set of 5×5, on which we applied the window function partition by function through the columns in list declared earlier, i.e., class and fees, and then sort it in ascending order of class. Further, we have added the row number along each entry according to the partitions done and displayed it in new column ‘row_number‘.
In this example, we have used a data frame (link), i.e., a data set of 5×5, on which we applied the window function partition by function through the columns in list declared earlier, i.e., age, class and fees, and then sort it in ascending order of age. Further, we have added a lag of 1 for each entry of subject and updated it in new column ‘Updated Subject.’
In this example, we have created a data frame using list comprehension with columns ‘Serial Number,’ ‘Brand,’ and ‘Model‘ on which we applied the window function partition by function through the columns in list declared earlier, i.e., Brand, Model, and then sort it in ascending order of Brand. Further, we have added a lag of 1 for each entry of Model and updated it in new column ‘Updated Model.’
Please Login to comment...