Open In App

Rename Duplicated Columns after Join in Pyspark dataframe

Last Updated : 23 Jan, 2023
Improve
Improve
Like Article
Like
Save
Share
Report

In this article, we are going to learn how to rename duplicate columns after join in Pyspark data frame in Python.

A distributed collection of data grouped into named columns is known as a Pyspark data frame. While handling a lot of data, we observe that not all data is coming from one data frame, thus there is a need to merge two or more data frames together. The merge or join can be inner, outer, left, right, etc., but after join, if we observe that some of the columns are duplicates in the data frame, then we will get stuck and not be able to apply functions on the joined data frame. Thus, we have explained in this article, how to rename duplicated columns after join in Pyspark data frame.

Steps to rename duplicated columns after join in Pyspark data frame:

Step 1: First of all, import the required library, i.e., SparkSession. The SparkSession library is used to create the session. 

from pyspark.sql import SparkSession

Step 2: Now, create a spark session using the getOrCreate() function. 

spark_session = SparkSession.builder.getOrCreate()

Step 3: Then, either read the CSV files for two data frames or create the two data frames using createDataFrame() function. 

data_frame=spark_session.createDataFrame([(column_1_data), (column_2_data), (column_3_data)],
                                          ['column_name_1', 'column_name_2', 'column_name_3'])

or

data_frame=csv_file = spark_session.read.csv('#Path of CSV file',
                                            sep = ',', inferSchema = True, header = True)

Step 4: Further, join the two data frames by choosing a common column in them. We can use any type of join, left, right, inner, outer, etc. Finally, rename the duplicate columns of either data frame using withColumnRenamed() function with parameters as the old column name and the new column name.

data_frame_1.withColumnRenamed(
'column_to_be_renamed','new_column_name_1').join(data_frame_2.withColumnRenamed(
'column_to_be_renamed','new_column_name_2'),
data_frame_1.column_to_be_joined_1 == data_frame_2.column_to_be_joined_2,"inner").show()

Example 1:

In this example, we have created two data frames, first with the fields ‘Roll_Number,’ ‘Name,’ ‘Fine’ and ‘Department_Number,’ and second with the fields ‘Fees,’ ‘Fine’ and ‘Match_Department_Number.’ 

First Data Frame:

 

Second Data Frame:

 

Here, we have joined the two data frames using inner join through the columns ‘Department_Number‘ of first data frame with the ‘Match_Department_Number‘ of second data frame. As we didn’t know the index of column to be renamed, thus we have renamed the column of the second data frame using withColumnRenamed function with parameters as old column name, i.e. ‘Fine‘ and the new column name, i.e., ‘Updated Fine‘.

Python3




# Rename duplicated columns after join in Pyspark 
# dataframe if you don't know the index of column
  
# Import the libraries SparkSession and col libraries
from pyspark.sql import SparkSession
  
# Create a spark session using getOrCreate() function
spark_session = SparkSession.builder.getOrCreate()
  
# Create the first data frame using createDataFrame function
data_frame_1=spark_session.createDataFrame(
  [(1, 'Arun', 100,112), (2, 'Ishita' , 200,123),
   (3, 'Vinayak' , 400, 112)],
  ['Roll_Number', 'Name', 'Fine', 'Department_Number'])
  
# Create the second data frame using createDataFrame function
data_frame_2=spark_session.createDataFrame(
  [(10000, 400, 112), (14000 , 500, 123),
   (12000 , 800, 136)],
  ['Fees', 'Fine', 'Match_Department_Number'])
  
# Renaming duplicated columns after join while not
# knowing index using withColumnRenamed function
data_frame_1.join(data_frame_2.withColumnRenamed(
                   "Fine","Updated Fine"),
   data_frame_1.Department_Number == data_frame_2.Match_Department_Number,
                  "inner").show()


Output:

 

Example 2:

In this example, we have created two data frames, first with the fields ‘Roll_Number,’ ‘Class,’ and ‘Subject,’ while second with the fields ‘Next_Class,’ and ‘Subject.’ 

First Data Frame:

 

Second Data Frame:

 

Here, we have joined the two data frames using outer join through the columns ‘Class‘ of the first data frame by adding one with the ‘Next_Class‘ of the second data frame. As we didn’t know the index of the column to be renamed, thus we have renamed the column of the first data frame using withColumnRenamed() function with parameters as old column name, i.e., ‘Subject‘ and the new column name, i.e., ‘Previous Year Subject‘.

Python3




# Rename duplicated columns after join in Pyspark 
# dataframe if you don't know the index of column
  
# Import the libraries SparkSession and col libraries
from pyspark.sql import SparkSession
  
# Create a spark session using getOrCreate() function
spark_session = SparkSession.builder.getOrCreate()
  
# Create the first data frame using createDataFrame function
data_frame_1=spark_session.createDataFrame(
  [(1, 5, 'Maths'), (2, 6, 'English'), (3, 8, 'Science')],
  ['Roll_Number', 'Class', 'Subject'])
  
# Create the second data frame using createDataFrame function
data_frame_2=spark_session.createDataFrame(
  [(6,'English'), (7, 'Social Science'), (9 ,'Computer')],
  [ 'Next_Class', 'Subject'])
  
# Renaming duplicated columns after join while not
# knowing index using withColumnRenamed function
data_frame_1.withColumnRenamed(
  "Subject","Previous Year Subject").join(data_frame_2,
   data_frame_1.Class+1 ==  data_frame_2.Next_Class,
                                        "outer").show()


Output:

 



Like Article
Suggest improvement
Previous
Next
Share your thoughts in the comments

Similar Reads