How to union multiple dataframe in PySpark?
Last Updated :
21 Feb, 2022
In this article, we will discuss how to union multiple data frames in PySpark.
Method 1: Union() function in pyspark
The PySpark union() function is used to combine two or more data frames having the same structure or schema. This function returns an error if the schema of data frames differs from each other.
Syntax: data_frame1.union(data_frame2)
Where,
- data_frame1 and data_frame2 are the dataframes
Example 1:
Python3
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName( 'GeeksforGeeks.com' ).getOrCreate()
data_frame1 = spark.createDataFrame(
[( "Bhuwanesh" , 82.98 ), ( "Harshit" , 80.31 )],
[ "Student Name" , "Overall Percentage" ]
)
data_frame2 = spark.createDataFrame(
[( "Naveen" , 91.123 ), ( "Piyush" , 90.51 )],
[ "Student Name" , "Overall Percentage" ]
)
answer = data_frame1.union(data_frame2)
answer.show()
|
Output:
+------------+------------------+
|Student Name|Overall Percentage|
+------------+------------------+
| Bhuwanesh| 82.98|
| Harshit| 80.31|
| Naveen| 91.123|
| Piyush| 90.51|
+------------+------------------+
Example 2:
In this example, we have combined two data frames, data_frame1 and data_frame2. Note that the schema of both the data frames is different. Hence, the output is not the desired one as union() can be applied on datasets having the same structure.
Python3
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName( 'GeeksforGeeks.com' ).getOrCreate()
data_frame1 = spark.createDataFrame(
[( "Bhuwanesh" , 82.98 ), ( "Harshit" , 80.31 )],
[ "Student Name" , "Overall Percentage" ]
)
data_frame2 = spark.createDataFrame(
[( 91.123 , "Naveen" ), ( 90.51 , "Piyush" ), ( 87.67 , "Hitesh" )],
[ "Overall Percentage" , "Student Name" ]
)
answer = data_frame1.union(data_frame2)
answer.show()
|
Output:
+------------+------------------+
|Student Name|Overall Percentage|
+------------+------------------+
| Bhuwanesh| 82.98|
| Harshit| 80.31|
| 91.123| Naveen|
| 90.51| Piyush|
| 87.67| Hitesh|
+------------+------------------+
Method 2: UnionByName() function in pyspark
The PySpark unionByName() function is also used to combine two or more data frames but it might be used to combine dataframes having different schema. This is because it combines data frames by the name of the column and not the order of the columns.
Syntax: data_frame1.unionByName(data_frame2)
Where,
- data_frame1 and data_frame2 are the dataframes
Example 1:
In this example, both data frames, data_frame1 and data_frame2 are of the same schema.
Python3
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName( 'GeeksforGeeks.com' ).getOrCreate()
data_frame1 = spark.createDataFrame(
[( "Bhuwanesh" , 82.98 ), ( "Harshit" , 80.31 )],
[ "Student Name" , "Overall Percentage" ]
)
data_frame2 = spark.createDataFrame(
[( "Naveen" , 91.123 ), ( "Piyush" , 90.51 )],
[ "Student Name" , "Overall Percentage" ]
)
answer = data_frame1.unionByName(data_frame2)
answer.show()
|
Output:
+------------+------------------+
|Student Name|Overall Percentage|
+------------+------------------+
| Bhuwanesh| 82.98|
| Harshit| 80.31|
| Naveen| 91.123|
| Piyush| 90.51|
+------------+------------------+
Example 2:
In this example, data_frame1 and data_frame2 are of different schema but the output is the desired one.
Python3
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName( 'GeeksforGeeks.com' ).getOrCreate()
data_frame1 = spark.createDataFrame(
[( "Bhuwanesh" , 82.98 ), ( "Harshit" , 80.31 )],
[ "Student Name" , "Overall Percentage" ]
)
data_frame2 = spark.createDataFrame(
[( 91.123 , "Naveen" ), ( 90.51 , "Piyush" ), ( 87.67 , "Hitesh" )],
[ "Overall Percentage" , "Student Name" ]
)
answer = data_frame1.unionByName(data_frame2)
answer.show()
|
Output:
+------------+------------------+
|Student Name|Overall Percentage|
+------------+------------------+
| Bhuwanesh| 82.98|
| Harshit| 80.31|
| Naveen| 91.123|
| Piyush| 90.51|
| Hitesh| 87.67|
+------------+------------------+
Example 3:
Let’s now consider two data frames that contain an unequal number of columns (entirely different schema). In this case, we need to pass an additional argument “allowMissingColumns = True” to the unionByName function.
Python3
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName( 'GeeksforGeeks.com' ).getOrCreate()
data_frame1 = spark.createDataFrame(
[( "Bhuwanesh" , 82.98 , "Computer Science" ),
( "Harshit" , 80.31 , "Information Technology" )],
[ "Student Name" , "Overall Percentage" , "Department" ]
)
data_frame2 = spark.createDataFrame(
[( "Naveen" , 91.123 ), ( "Piyush" , 90.51 )],
[ "Student Name" , "Overall Percentage" ]
)
res = data_frame1.unionByName(data_frame2, allowMissingColumns = True )
res.show()
|
Output:
+------------+------------------+--------------------+
|Student Name|Overall Percentage| Department|
+------------+------------------+--------------------+
| Bhuwanesh| 82.98| Computer Science|
| Harshit| 80.31|Information Techn...|
| Naveen| 91.123| null|
| Piyush| 90.51| null|
+------------+------------------+--------------------+
Share your thoughts in the comments
Please Login to comment...