Open In App

Concatenate two PySpark dataframes

Last Updated : 04 Jan, 2022
Improve
Improve
Like Article
Like
Save
Share
Report

In this article, we are going to see how to concatenate two pyspark dataframe using Python.

Creating Dataframe for demonstration:

Python3




# Importing necessary libraries
from pyspark.sql import SparkSession
  
# Create a spark session
spark = SparkSession.builder.appName('pyspark - example join').getOrCreate()
  
# Create data in dataframe
data = [(('Ram'), '1991-04-01', 'M', 3000),
        (('Mike'), '2000-05-19', 'M', 4000),
        (('Rohini'), '1978-09-05', 'M', 4000),
        (('Maria'), '1967-12-01', 'F', 4000),
        (('Jenis'), '1980-02-17', 'F', 1200)]
  
# Column names in dataframe
columns = ["Name", "DOB", "Gender", "salary"]
  
# Create the spark dataframe
df1 = spark.createDataFrame(data=data, schema=columns)
  
# Print the dataframe
df1.show()


Output:

+------+----------+------+------+
|  Name|       DOB|Gender|salary|
+------+----------+------+------+
|   Ram|1991-04-01|     M|  3000|
|  Mike|2000-05-19|     M|  4000|
|Rohini|1978-09-05|     M|  4000|
| Maria|1967-12-01|     F|  4000|
| Jenis|1980-02-17|     F|  1200|
+------+----------+------+------+

Creating Second dataframe for demonstration:

Python3




# Create data in dataframe
data2 = [(('Mohi'), '1991-04-01', 'M', 3000),
         (('Ani'), '2000-05-19', 'F', 4300),
         (('Shipta'), '1978-09-05', 'F', 4200),
         (('Jessy'), '1967-12-01', 'F', 4010),
         (('kanne'), '1980-02-17', 'F', 1200)]
  
# Column names in dataframe
columns = ["Name", "DOB", "Gender", "salary"]
  
# Create the spark dataframe
df2 = spark.createDataFrame(data=data, schema=columns)
  
# Print the dataframe
df2.show()


Output:

+------+----------+------+------+
|  Name|       DOB|Gender|salary|
+------+----------+------+------+
|   Ram|1991-04-01|     M|  3000|
|  Mike|2000-05-19|     M|  4000|
|Rohini|1978-09-05|     M|  4000|
| Maria|1967-12-01|     F|  4000|
| Jenis|1980-02-17|     F|  1200|
+------+----------+------+------+

Method 1: Using Union()

Union() methods of the DataFrame are employed to mix two DataFrame’s of an equivalent structure/schema.

Syntax: dataframe_1.union(dataframe_2)

where,

  1. dataframe_1 is the first dataframe
  2. dataframe_2 is the second dataframe

Example:

Python3




# union the above created dataframes
result = df1.union(df2)
  
# display
result.show()


Output:

+------+----------+------+------+
|  Name|       DOB|Gender|salary|
+------+----------+------+------+
|   Ram|1991-04-01|     M|  3000|
|  Mike|2000-05-19|     M|  4000|
|Rohini|1978-09-05|     M|  4000|
| Maria|1967-12-01|     F|  4000|
| Jenis|1980-02-17|     F|  1200|
|   Ram|1991-04-01|     M|  3000|
|  Mike|2000-05-19|     M|  4000|
|Rohini|1978-09-05|     M|  4000|
| Maria|1967-12-01|     F|  4000|
| Jenis|1980-02-17|     F|  1200|
+------+----------+------+------+

Method 2: Using unionByName()

 In Spark 3.1, you can easily achieve this using unionByName() for Concatenating the dataframe

Syntax: dataframe_1.unionByName(dataframe_2)

where,

  1. dataframe_1 is the first dataframe
  2. dataframe_2 is the second dataframe

Example:

Python3




# union the two dataftames by using unionByname
result1 = df1.unionByName(df2)
  
# display
result1.show()


Output:

+------+----------+------+------+
|  Name|       DOB|Gender|salary|
+------+----------+------+------+
|   Ram|1991-04-01|     M|  3000|
|  Mike|2000-05-19|     M|  4000|
|Rohini|1978-09-05|     M|  4000|
| Maria|1967-12-01|     F|  4000|
| Jenis|1980-02-17|     F|  1200|
|   Ram|1991-04-01|     M|  3000|
|  Mike|2000-05-19|     M|  4000|
|Rohini|1978-09-05|     M|  4000|
| Maria|1967-12-01|     F|  4000|
| Jenis|1980-02-17|     F|  1200|
+------+----------+------+------+

Method 3: Using functools

Functools module provides functions for working with other functions and callable objects to use or extend them without completely rewriting them.

Syntax:

functools.reduce(lambda df1, df2: df1.union(df2.select(df1.columns)), dfs)

where,

  • df1 is the first dataframe
  • df2 is the second dataframe

We create dataframes with columns ‘a’ and ‘b’ of some random values and pass these three dataframes to our above-created method unionAll() and obtain the resultant dataframe as output and show the result.

Example:

Python3




import functools
# explicit function
  
  
def unionAll(dfs):
    return functools.reduce(lambda df1, df2: df1.union(
      df2.select(df1.columns)), dfs)
  
  
# unionAll
result3 = unionAll([df1, df2])
result3.show()


Output:

+------+----------+------+------+
|  Name|       DOB|Gender|salary|
+------+----------+------+------+
|   Ram|1991-04-01|     M|  3000|
|  Mike|2000-05-19|     M|  4000|
|Rohini|1978-09-05|     M|  4000|
| Maria|1967-12-01|     F|  4000|
| Jenis|1980-02-17|     F|  1200|
|   Ram|1991-04-01|     M|  3000|
|  Mike|2000-05-19|     M|  4000|
|Rohini|1978-09-05|     M|  4000|
| Maria|1967-12-01|     F|  4000|
| Jenis|1980-02-17|     F|  1200|
+------+----------+------+------+


Like Article
Suggest improvement
Previous
Next
Share your thoughts in the comments

Similar Reads