Removing duplicate columns after DataFrame join in PySpark
Last Updated :
29 Dec, 2021
In this article, we will discuss how to remove duplicate columns after a DataFrame join in PySpark.
Create the first dataframe for demonstration:
Python3
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('pyspark \
- example join').getOrCreate()
data = [(( 'Ram' ), 1 , 'M' ),
(( 'Mike' ), 2 , 'M' ),
(( 'Rohini' ), 3 , 'M' ),
(( 'Maria' ), 4 , 'F' ),
(( 'Jenis' ), 5 , 'F' )]
columns = [ "Name" , "ID" , "Gender" ]
df1 = spark.createDataFrame(data = data, schema = columns)
df1.show()
|
Output:
+------+---+------+
| Name| ID|Gender|
+------+---+------+
| Ram| 1| M|
| Mike| 2| M|
|Rohini| 3| M|
| Maria| 4| F|
| Jenis| 5| F|
+------+---+------+
Create a second dataframe for demonstration:
Python3
data2 = [( 1 , 3000 ),
( 2 , 4000 ),
( 3 , 4000 ),
( 4 , 4000 ),
( 5 , 1200 )]
columns = [ "ID" , "salary" ]
df2 = spark.createDataFrame(data = data2,
schema = columns)
df2.show()
|
Output:
+---+------+
| ID|salary|
+---+------+
| 1| 3000|
| 2| 4000|
| 3| 4000|
| 4| 4000|
| 5| 1200|
+---+------+
Using join()
This will join the two dataframes
Syntax: dataframe.join(dataframe1).show()
where,
dataframe is the first dataframe
dataframe1 is the second dataframe
Let’s see the dataframe after join:
Python3
df = df1.join(df2, df1. ID = = df2. ID )
df.show()
|
Output:
+------+---+------+---+------+
| Name| ID|Gender| ID|salary|
+------+---+------+---+------+
| Jenis| 5| F| 5| 1200|
| Ram| 1| M| 1| 3000|
|Rohini| 3| M| 3| 4000|
| Mike| 2| M| 2| 4000|
| Maria| 4| F| 4| 4000|
+------+---+------+---+------+
Here we see the ID and Salary columns are added to our existing article.
Now, let check the columns once:
Here we check gender columns which is unique so its work fine.
Python3
df.select( 'Gender' ).show()
|
Output:
+------+
|Gender|
+------+
| F|
| M|
| M|
| M|
| F|
+------+
Now let check our Duplicates Columns:
Here it will produce errors because of duplicate columns.
Output:
AnalysisException: Reference ‘ID’ is ambiguous, could be: ID, ID.
Removing duplicate columns after join in PySpark
If we want to drop the duplicate column, then we have to specify the duplicate column in the join function. Here we are simply using join to join two dataframes and then drop duplicate columns.
Syntax: dataframe.join(dataframe1, [‘column_name’]).show()
where,
- dataframe is the first dataframe
- dataframe1 is the second dataframe
- column_name is the common column exists in two dataframes
Python3
new_df = df1.join(df2, [ "id" ])
new_df.show()
|
Output:
+---+------+------+------+
| ID| Name|Gender|salary|
+---+------+------+------+
| 5| Jenis| F| 1200|
| 1| Ram| M| 3000|
| 3|Rohini| M| 4000|
| 2| Mike| M| 4000|
| 4| Maria| F| 4000|
+---+------+------+------+
Like Article
Suggest improvement
Share your thoughts in the comments
Please Login to comment...