Open In App

How to avoid duplicate columns after join in PySpark ?

Improve
Improve
Like Article
Like
Save
Share
Report

In this article, we will discuss how to avoid duplicate columns in DataFrame after join in PySpark using Python.

Create the first dataframe for demonstration:

Python3




# importing module
import pyspark
  
# importing sparksession from pyspark.sql module
from pyspark.sql import SparkSession
  
# creating sparksession and giving an app name
spark = SparkSession.builder.appName('sparkdf').getOrCreate()
  
# list  of employee data
data = [["1", "sravan", "company 1"],
        ["2", "ojaswi", "company 1"],
        ["3", "rohith", "company 2"],
        ["4", "sridevi", "company 1"],
        ["5", "bobby", "company 1"]]
  
# specify column names
columns = ['ID', 'NAME', 'Company']
  
# creating a dataframe from the lists of data
dataframe = spark.createDataFrame(data, columns)
  
dataframe.show()


Output:

Create a second dataframe for demonstration:

Python3




# list  of employee data
data1 = [["1", "45000", "IT"],
         ["2", "145000", "Manager"],
         ["6", "45000", "HR"],
         ["5", "34000", "Sales"]]
  
# specify column names
columns = ['ID', 'salary', 'department']
  
# creating a dataframe from the lists of data
dataframe1 = spark.createDataFrame(data1, columns)
  
dataframe1.show()


Output:

Method 1: Using drop() function

We can join the dataframes using joins like inner join and after this join, we can use the drop method to remove one duplicate column.

Syntax: dataframe.join(dataframe1,dataframe.column_name == dataframe1.column_name,”inner”).drop(dataframe.column_name)

where,

  • dataframe is the first dataframe
  • dataframe1 is the second dataframe
  • inner specifies inner join
  • drop() will delete the common column and delete first dataframe column

Example: Join two dataframes based on ID and remove duplicate ID in first dataframe

Python3




# inner join on two dataframes
# and remove duplicate column
dataframe.join(dataframe1,
               dataframe.ID == dataframe1.ID,
               "inner").drop(dataframe.ID).show()


Output:

Method 2: Using join()

Here we are simply using join to join two dataframes and then drop duplicate columns.

Syntax: dataframe.join(dataframe1, [‘column_name’]).show()

where,

  • dataframe is the first dataframe
  • dataframe1 is the second dataframe
  • column_name is the common column exists in two dataframes

Example: Join based on ID and remove duplicates

Python3




# join on two dataframes
# and remove duplicate column
dataframe.join(dataframe1, ['ID']).show()


Output:



Last Updated : 19 Dec, 2021
Like Article
Save Article
Previous
Next
Share your thoughts in the comments
Similar Reads