How to Convert Pandas to PySpark DataFrame ?

Last Updated : 22 Mar, 2023

In this article, we will learn How to Convert Pandas to PySpark DataFrame. Sometimes we will get csv, xlsx, etc. format data, and we have to store it in PySpark DataFrame and that can be done by loading data in Pandas then converted PySpark DataFrame. For conversion, we pass the Pandas dataframe into the CreateDataFrame() method.

Syntax: spark.createDataframe(data, schema)

Parameter:

data – list of values on which dataframe is created.

schema – It’s the structure of dataset or list of column names.

where spark is the SparkSession object.

Example 1: Create a DataFrame and then Convert using spark.createDataFrame() method

Python3

# import the pandas
import pandas as pd
 
# from  pyspark library import 
# SparkSession
from pyspark.sql import SparkSession
 
# Building the SparkSession and name
# it :'pandas to spark'
spark = SparkSession.builder.appName(
  "pandas to spark").getOrCreate()
 
# Create the DataFrame with the help 
# of pd.DataFrame()
data = pd.DataFrame({'State': ['Alaska', 'California',
                               'Florida', 'Washington'],
                      
                     'city': ["Anchorage", "Los Angeles", 
                              "Miami", "Bellevue"]})
 
# create DataFrame
df_spark = spark.createDataFrame(data)
 
df_spark.show()

Output:

Example 2: Create a DataFrame and then Convert using spark.createDataFrame() method

In this method, we are using Apache Arrow to convert Pandas to Pyspark DataFrame.

Python3

import the pandas
import pandas as pd
 
# from  pyspark library import 
# SparkSession
from pyspark.sql import SparkSession
 
# Building the SparkSession and name 
# it :'pandas to spark'
spark = SparkSession.builder.appName(
  "pandas to spark").getOrCreate()
 
# Create the DataFrame with the help 
# of pd.DataFrame()
data = pd.DataFrame({'State': ['Alaska', 'California',
                               'Florida', 'Washington'],
                      
                     'city': ["Anchorage", "Los Angeles",
                              "Miami", "Bellevue"]})
 
 
# enabling the Apache Arrow for converting
# Pandas to pySpark DF(DataFrame)
spark.conf.set("spark.sql.execution.arrow.enabled", "true")
 
# Creating the DataFrame
sprak_arrow = spark.createDataFrame(data)
 
# Show the DataFrame
sprak_arrow.show()

Output:

Example 3: Load a DataFrame from CSV and then Convert

In this method, we can easily read the CSV file in Pandas Dataframe as well as in Pyspark Dataframe. The dataset used here is heart.csv.

Python3

# import the pandas library
import pandas as pd
 
# Read the Dataset in Pandas Dataframe
df_pd = pd.read_csv('heart.csv')
 
# Show the dataset here head() 
# will return top 5 rows
df_pd.head()

Output:

Python3

# Reading the csv file in 
# Pyspark DataFrame
df_spark2 = spark.read.option(
  'header', 'true').csv("heart.csv")
 
# Showing the data in the form of 
# table and showing only top 5 rows
df_spark2.show(5)

Output:

We can also convert pyspark Dataframe to pandas Dataframe. For this, we will use DataFrame.toPandas() method.

Syntax: DataFrame.toPandas()

Returns the contents of this DataFrame as Pandas pandas.DataFrame.

Python3

# Convert Pyspark DataFrame to 
# Pandas DataFrame by toPandas() 
# Function head() will show only
# top 5 rows of the dataset
df_spark2.toPandas().head()

Output:

Suggest improvement

Sort the PySpark DataFrame columns by Ascending or Descending order

How to Execute Shell Commands in a Remote Machine using Python - Paramiko

Share your thoughts in the comments

How to Convert Pandas to PySpark DataFrame ?

Python3

Python3

Python3

Python3

Python3

Please Login to comment...

Similar Reads

What kind of Experience do you want to share?