Open In App

Outer join Spark dataframe with non-identical join column

Improve
Improve
Like Article
Like
Save
Share
Report

In PySpark, data frames are one of the most important data structures used for data processing and manipulation. The outer join operation in PySpark data frames is an important operation to combine data from multiple sources. However, sometimes the join column in the two DataFrames may not be identical, which may result in missing values.

In this article, we will discuss how to perform an outer join operation on two PySpark DataFrame with non-identical join columns and then merge the join columns.

Syntax of join() function

Syntax: DataFrame.join(other, on=None, how=None)

Parameters: 

  • other: DataFrame. Right side of the join
  • on :  str, list or Column, optional. A list of column names, a join expression (Column).
  • how : str, optional
    standard inner Inner, cross, outer, full, fullouter, full outer, left, leftouter, left outer, right, rightouter, right outer, semi, leftsemi, left semi, anti, leftanti, and left anti are the only options that can be used.

Dataframes Used for Outer Join and Merge Join Columns in PySpark

To illustrate the concept of outer join and merging join columns in PySpark data frames, we will create two sample data frames with non-identical join columns. We can see that the join column in the first data frame is “Name,” and the join column in the second data frame is “Name” Also, the values in the join column are not identical.

Outer join Spark dataframe with non-identical join column

Dataframe 1

Outer join Spark dataframe with non-identical join column

 

Outer Join using the Join function

To perform an outer join on the two DataFrame, we will use the “join” function in PySpark. The “join” function accepts the two DataFrame and the join column name as arguments. The outer join operation returns all the rows from both DataFrame, along with the matching rows. For non-matching rows, the corresponding columns will contain null values.

Python3




from pyspark.sql import SparkSession
from pyspark.sql.functions import *
  
# Create a SparkSession
spark = SparkSession.builder.appName
                ("OuterJoin").getOrCreate()
  
# Create Dataframe 1
  
df1 = spark.createDataFrame([
    ("Alice", 22, "Female"),
    ("Bob", 35, "Male"),
    ("Jack", 28, "Male"),
    ("Jill", 30, "Female")
], ["Name", "Age", "Gender"])
  
df1.show()
# Create Dataframe 2
  
df2 = spark.createDataFrame([
    ("Alice", "Chicago", "IL"),
    ("Bob", "Boston", "MA"),
    ("Charlie", "Houston", "TX"),
    ("David", "Austin", "TX")
], ["Name", "City", "State"])
df2.show()
  
# Outer Join operation
# df3 = df1.join(df2, "Name", how='outer')
df3 = df1.join(df2, df1.Name == df2.Name, how='outer')
  
df3.show()


Output:

Here, we can see that “dataframe1” and “dataframe2” is outer merged, where data is not present “null” is provided in that place. All the columns in both DataFrame are present in the final DataFrame.

Outer join Spark dataframe with non-identical join column

 

Now, we will use one column for our Name column to increase readability with some simple changes, If you want both columns use the above method.

Python3




# Outer Join operation
df3 = df1.join(df2, "Name", how='outer')
  
df3.show()


Output:

Outer join Spark dataframe with non-identical join column

 

Outer Join Using Merge

The merge method is not available in PySpark. However, it is available in Pandas. If you are working with small data sets and you want to use the merge method, you can convert your PySpark data frames to Pandas data frames, merge them using the merge method, and then convert the resulting Pandas data frame back to a PySpark data frame. 

Here, we first create two PySpark data frames “df1” and “df2”. We then convert these data frames to Pandas data frames using the toPandas() method. Next, we use the merge method from Pandas to merge the two data frames pdf1 and pdf2 on the common column “Name” and “people”. We use the how=”outer” argument to perform an outer join. Finally, we convert the resulting Pandas data frame pdf3 back to a PySpark data frame using the createDataFrame() method. 

Python3




from pyspark.sql import SparkSession
import pandas as pd
  
# Create a SparkSession
spark = SparkSession.builder.appName
                ("OuterJoin").getOrCreate()
  
# Create Dataframe 1
df1 = spark.createDataFrame([
    ("Alice", 22, "Female"),
    ("Bob", 35, "Male"),
    ("Jack", 28, "Male"),
    ("Jill", 30, "Female")
], ["Name", "Age", "Gender"])
  
df1.show()
  
# Create Dataframe 2
df2 = spark.createDataFrame([
    ("Alice", "Chicago", "IL"),
    ("Bob", "Boston", "MA"),
    ("Charlie", "Houston", "TX"),
    ("David", "Austin", "TX")
], ["people", "City", "State"])
df2.show()
  
# Convert PySpark data frames to Pandas data frames
pdf1 = df1.toPandas()
pdf2 = df2.toPandas()
  
# Merge Pandas data frames using the merge method
pdf3 = pd.merge(pdf1, pdf2, how="outer", left_on="Name", right_on="people")
  
print(pdf3)


Output:

Outer join Spark dataframe with non-identical join column

 



Last Updated : 16 Mar, 2023
Like Article
Save Article
Previous
Next
Share your thoughts in the comments
Similar Reads