Python PySpark – Union and UnionAll
Last Updated :
21 Feb, 2022
In this article, we will discuss Union and UnionAll in PySpark in Python.
Union in PySpark
The PySpark union() function is used to combine two or more data frames having the same structure or schema. This function returns an error if the schema of data frames differs from each other.
Syntax:
dataFrame1.union(dataFrame2)
Here,
- dataFrame1 and dataFrame2 are the dataframes
Example 1:
In this example, we have combined two data frames, data_frame1 and data_frame2. Note that the schema of both the data frames is the same.
Python3
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName( 'GeeksforGeeks.com' ).getOrCreate()
data_frame1 = spark.createDataFrame(
[( "Bhuwanesh" , 82.98 ), ( "Harshit" , 80.31 )],
[ "Student Name" , "Overall Percentage" ]
)
data_frame2 = spark.createDataFrame(
[( "Naveen" , 91.123 ), ( "Piyush" , 90.51 )],
[ "Student Name" , "Overall Percentage" ]
)
answer = data_frame1.union(data_frame2)
answer.show()
|
Output:
Example 2:
In this example, we have combined two data frames, data_frame1 and data_frame2. Note that the schema of both the data frames is different. Hence, the output is not the desired one as union() function is ideal for datasets having the same structure or schema.
Python3
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName( 'GeeksforGeeks.com' ).getOrCreate()
data_frame1 = spark.createDataFrame(
[( "Bhuwanesh" , 82.98 ), ( "Harshit" , 80.31 )],
[ "Student Name" , "Overall Percentage" ]
)
data_frame2 = spark.createDataFrame(
[( 91.123 , "Naveen" ), ( 90.51 , "Piyush" ), ( 87.67 , "Hitesh" )],
[ "Overall Percentage" , "Student Name" ]
)
answer = data_frame1.union(data_frame2)
answer.show()
|
Output:
UnionAll() in PySpark
UnionAll() function does the same task as union() function but this function is deprecated since Spark “2.0.0” version. Hence, union() function is recommended.
Syntax:
dataFrame1.unionAll(dataFrame2)
Here,
- dataFrame1 and dataFrame2 are the dataframes
Example 1:
In this example, we have combined two data frames, data_frame1 and data_frame2. Note that the schema of both the data frames is the same.
Python3
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName( 'GeeksforGeeks.com' ).getOrCreate()
data_frame1 = spark.createDataFrame(
[( "Bhuwanesh" , 82.98 ), ( "Harshit" , 80.31 )],
[ "Student Name" , "Overall Percentage" ]
)
data_frame2 = spark.createDataFrame(
[( "Naveen" , 91.123 ), ( "Piyush" , 90.51 )],
[ "Student Name" , "Overall Percentage" ]
)
answer = data_frame1.unionAll(data_frame2)
answer.show()
|
Output:
Example 2:
In this example, we have combined two data frames, data_frame1 and data_frame2. Note that the schema of both the data frames is different. Hence, the output is not the desired one as unionAll() function is ideal for datasets having the same structure or schema.
Python3
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName( 'GeeksforGeeks.com' ).getOrCreate()
data_frame1 = spark.createDataFrame(
[( "Bhuwanesh" , 82.98 ), ( "Harshit" , 80.31 )],
[ "Student Name" , "Overall Percentage" ]
)
data_frame2 = spark.createDataFrame(
[( 91.123 , "Naveen" ), ( 90.51 , "Piyush" ), ( 87.67 , "Hitesh" )],
[ "Overall Percentage" , "Student Name" ]
)
answer = data_frame1.unionAll(data_frame2)
answer.show()
|
Output:
Share your thoughts in the comments
Please Login to comment...