Open In App

How to join on multiple columns in Pyspark?

Last Updated : 19 Dec, 2021
Improve
Improve
Like Article
Like
Save
Share
Report

In this article, we will discuss how to join multiple columns in PySpark Dataframe using Python.

Let’s create the first dataframe:

Python3




# importing module
import pyspark
  
# importing sparksession from pyspark.sql module
from pyspark.sql import SparkSession
  
# creating sparksession and giving an app name
spark = SparkSession.builder.appName('sparkdf').getOrCreate()
  
# list  of employee data
data = [(1, "sravan"), (2, "ojsawi"), (3, "bobby")]
  
# specify column names
columns = ['ID1', 'NAME1']
  
# creating a dataframe from the lists of data
dataframe = spark.createDataFrame(data, columns)
  
dataframe.show()


Output:

Let’s create the second dataframe:

Python3




# importing module
import pyspark
  
# importing sparksession from pyspark.sql module
from pyspark.sql import SparkSession
  
# creating sparksession and giving an app name
spark = SparkSession.builder.appName('sparkdf').getOrCreate()
  
# list  of employee data
data = [(1, "sravan"), (2, "ojsawi"),
        (3, "bobby"),
        (4, "rohith"), (5, "gnanesh")]
  
# specify column names
columns = ['ID2', 'NAME2']
  
# creating a dataframe from the lists of data
dataframe1 = spark.createDataFrame(data, columns)
  
dataframe1.show()


Output:

we can join the multiple columns by using join() function using conditional operator

Syntax: dataframe.join(dataframe1, (dataframe.column1== dataframe1.column1) & (dataframe.column2== dataframe1.column2))

where, 

  • dataframe is the first dataframe
  • dataframe1 is the second dataframe
  • column1 is the first matching column in both the dataframes
  • column2 is the second matching column in both the dataframes

Example 1: PySpark code to join the two dataframes with multiple columns (id and name)

Python3




# importing module
import pyspark
  
# importing sparksession from pyspark.sql module
from pyspark.sql import SparkSession
  
# creating sparksession and giving an app name
spark = SparkSession.builder.appName('sparkdf').getOrCreate()
  
# list  of employee data
data = [(1, "sravan"), (2, "ojsawi"), (3, "bobby")]
  
# specify column names
columns = ['ID1', 'NAME1']
  
# creating a dataframe from the lists of data
dataframe = spark.createDataFrame(data, columns)
  
# list  of employee data
data = [(1, "sravan"), (2, "ojsawi"), (3, "bobby"),
        (4, "rohith"), (5, "gnanesh")]
  
# specify column names
columns = ['ID2', 'NAME2']
  
# creating a dataframe from the lists of data
dataframe1 = spark.createDataFrame(data, columns)
  
# join based on ID and name column
dataframe.join(dataframe1, (dataframe.ID1 == dataframe1.ID2)
               & (dataframe.NAME1 == dataframe1.NAME2)).show()


Output:

Example 2: Join with or operator

Python3




# importing module
import pyspark
  
# importing sparksession from pyspark.sql module
from pyspark.sql import SparkSession
  
# creating sparksession and giving an app name
spark = SparkSession.builder.appName('sparkdf').getOrCreate()
  
# list  of employee data
data = [(1, "sravan"), (2, "ojsawi"), (3, "bobby")]
  
# specify column names
columns = ['ID1', 'NAME1']
  
# creating a dataframe from the lists of data
dataframe = spark.createDataFrame(data, columns)
  
# list  of employee data
data = [(1, "sravan"), (2, "ojsawi"), (3, "bobby"),
        (4, "rohith"), (5, "gnanesh")]
  
# specify column names
columns = ['ID2', 'NAME2']
  
# creating a dataframe from the lists of data
dataframe1 = spark.createDataFrame(data, columns)
  
# join based on ID and name column
dataframe.join(dataframe1, (dataframe.ID1 == dataframe1.ID2)
               | (dataframe.NAME1 == dataframe1.NAME2)).show()


Output:



Like Article
Suggest improvement
Previous
Next
Share your thoughts in the comments

Similar Reads