Concatenate two PySpark dataframes

Last Updated : 04 Jan, 2022

In this article, we are going to see how to concatenate two pyspark dataframe using Python.

Creating Dataframe for demonstration:

Python3

# Importing necessary libraries 
from pyspark.sql import SparkSession 
  
# Create a spark session 
spark = SparkSession.builder.appName('pyspark - example join').getOrCreate() 
  
# Create data in dataframe 
data = [(('Ram'), '1991-04-01', 'M', 3000), 
        (('Mike'), '2000-05-19', 'M', 4000), 
        (('Rohini'), '1978-09-05', 'M', 4000), 
        (('Maria'), '1967-12-01', 'F', 4000), 
        (('Jenis'), '1980-02-17', 'F', 1200)] 
  
# Column names in dataframe 
columns = ["Name", "DOB", "Gender", "salary"] 
  
# Create the spark dataframe 
df1 = spark.createDataFrame(data=data, schema=columns) 
  
# Print the dataframe 
df1.show() 

Output:

+------+----------+------+------+
|  Name|       DOB|Gender|salary|
+------+----------+------+------+
|   Ram|1991-04-01|     M|  3000|
|  Mike|2000-05-19|     M|  4000|
|Rohini|1978-09-05|     M|  4000|
| Maria|1967-12-01|     F|  4000|
| Jenis|1980-02-17|     F|  1200|
+------+----------+------+------+

Creating Second dataframe for demonstration:

Python3

# Create data in dataframe 
data2 = [(('Mohi'), '1991-04-01', 'M', 3000), 
         (('Ani'), '2000-05-19', 'F', 4300), 
         (('Shipta'), '1978-09-05', 'F', 4200), 
         (('Jessy'), '1967-12-01', 'F', 4010), 
         (('kanne'), '1980-02-17', 'F', 1200)] 
  
# Column names in dataframe 
columns = ["Name", "DOB", "Gender", "salary"] 
  
# Create the spark dataframe 
df2 = spark.createDataFrame(data=data, schema=columns) 
  
# Print the dataframe 
df2.show() 

Output:

+------+----------+------+------+
|  Name|       DOB|Gender|salary|
+------+----------+------+------+
|   Ram|1991-04-01|     M|  3000|
|  Mike|2000-05-19|     M|  4000|
|Rohini|1978-09-05|     M|  4000|
| Maria|1967-12-01|     F|  4000|
| Jenis|1980-02-17|     F|  1200|
+------+----------+------+------+

Method 1: Using Union()

Union() methods of the DataFrame are employed to mix two DataFrame’s of an equivalent structure/schema.

Syntax: dataframe_1.union(dataframe_2)

where,

dataframe_1 is the first dataframe

dataframe_2 is the second dataframe

Example:

Python3

# union the above created dataframes 
result = df1.union(df2) 
  
# display 
result.show() 

Output:

+------+----------+------+------+
|  Name|       DOB|Gender|salary|
+------+----------+------+------+
|   Ram|1991-04-01|     M|  3000|
|  Mike|2000-05-19|     M|  4000|
|Rohini|1978-09-05|     M|  4000|
| Maria|1967-12-01|     F|  4000|
| Jenis|1980-02-17|     F|  1200|
|   Ram|1991-04-01|     M|  3000|
|  Mike|2000-05-19|     M|  4000|
|Rohini|1978-09-05|     M|  4000|
| Maria|1967-12-01|     F|  4000|
| Jenis|1980-02-17|     F|  1200|
+------+----------+------+------+

Method 2: Using unionByName()

In Spark 3.1, you can easily achieve this using unionByName() for Concatenating the dataframe

Syntax: dataframe_1.unionByName(dataframe_2)

where,

dataframe_1 is the first dataframe

dataframe_2 is the second dataframe

Example:

Python3

# union the two dataftames by using unionByname 
result1 = df1.unionByName(df2) 
  
# display 
result1.show() 

Output:

+------+----------+------+------+
|  Name|       DOB|Gender|salary|
+------+----------+------+------+
|   Ram|1991-04-01|     M|  3000|
|  Mike|2000-05-19|     M|  4000|
|Rohini|1978-09-05|     M|  4000|
| Maria|1967-12-01|     F|  4000|
| Jenis|1980-02-17|     F|  1200|
|   Ram|1991-04-01|     M|  3000|
|  Mike|2000-05-19|     M|  4000|
|Rohini|1978-09-05|     M|  4000|
| Maria|1967-12-01|     F|  4000|
| Jenis|1980-02-17|     F|  1200|
+------+----------+------+------+

Method 3: Using functools

Functools module provides functions for working with other functions and callable objects to use or extend them without completely rewriting them.

Syntax:

functools.reduce(lambda df1, df2: df1.union(df2.select(df1.columns)), dfs)

where,

df1 is the first dataframe

df2 is the second dataframe

We create dataframes with columns ‘a’ and ‘b’ of some random values and pass these three dataframes to our above-created method unionAll() and obtain the resultant dataframe as output and show the result.

Example:

Python3

import functools 
# explicit function 
  
  
def unionAll(dfs): 
    return functools.reduce(lambda df1, df2: df1.union( 
      df2.select(df1.columns)), dfs) 
  
  
# unionAll 
result3 = unionAll([df1, df2]) 
result3.show() 

Output:

+------+----------+------+------+
|  Name|       DOB|Gender|salary|
+------+----------+------+------+
|   Ram|1991-04-01|     M|  3000|
|  Mike|2000-05-19|     M|  4000|
|Rohini|1978-09-05|     M|  4000|
| Maria|1967-12-01|     F|  4000|
| Jenis|1980-02-17|     F|  1200|
|   Ram|1991-04-01|     M|  3000|
|  Mike|2000-05-19|     M|  4000|
|Rohini|1978-09-05|     M|  4000|
| Maria|1967-12-01|     F|  4000|
| Jenis|1980-02-17|     F|  1200|
+------+----------+------+------+

Suggest improvement

Show distinct column values in PySpark dataframe

Python - Change legend size in Plotly chart

Share your thoughts in the comments

Concatenate two PySpark dataframes

Python3

Python3

Method 1: Using Union()

Python3

Method 2: Using unionByName()

Python3

Method 3: Using functools

Python3

Please Login to comment...

Similar Reads

What kind of Experience do you want to share?