Open In App

How to Convert Pandas to PySpark DataFrame ?

Last Updated : 22 Mar, 2023
Improve
Improve
Like Article
Like
Save
Share
Report

In this article, we will learn How to Convert Pandas to PySpark DataFrame. Sometimes we will get csv, xlsx, etc. format data, and we have to store it in PySpark DataFrame and that can be done by loading data in Pandas then converted PySpark DataFrame. For conversion, we pass the Pandas dataframe into the CreateDataFrame() method.

Syntax: spark.createDataframe(data, schema)

Parameter:

  • data – list of values on which dataframe is created.
  • schema – It’s the structure of dataset or list of column names.

where spark is the SparkSession object.

Example 1: Create a DataFrame and then Convert using spark.createDataFrame() method

Python3




# import the pandas
import pandas as pd
 
# from  pyspark library import
# SparkSession
from pyspark.sql import SparkSession
 
# Building the SparkSession and name
# it :'pandas to spark'
spark = SparkSession.builder.appName(
  "pandas to spark").getOrCreate()
 
# Create the DataFrame with the help
# of pd.DataFrame()
data = pd.DataFrame({'State': ['Alaska', 'California',
                               'Florida', 'Washington'],
                      
                     'city': ["Anchorage", "Los Angeles",
                              "Miami", "Bellevue"]})
 
# create DataFrame
df_spark = spark.createDataFrame(data)
 
df_spark.show()


Output:

Example 2: Create a DataFrame and then Convert using spark.createDataFrame() method

In this method, we are using Apache Arrow to convert Pandas to Pyspark DataFrame.

Python3




import the pandas
import pandas as pd
 
# from  pyspark library import
# SparkSession
from pyspark.sql import SparkSession
 
# Building the SparkSession and name
# it :'pandas to spark'
spark = SparkSession.builder.appName(
  "pandas to spark").getOrCreate()
 
# Create the DataFrame with the help
# of pd.DataFrame()
data = pd.DataFrame({'State': ['Alaska', 'California',
                               'Florida', 'Washington'],
                      
                     'city': ["Anchorage", "Los Angeles",
                              "Miami", "Bellevue"]})
 
 
# enabling the Apache Arrow for converting
# Pandas to pySpark DF(DataFrame)
spark.conf.set("spark.sql.execution.arrow.enabled", "true")
 
# Creating the DataFrame
sprak_arrow = spark.createDataFrame(data)
 
# Show the DataFrame
sprak_arrow.show()


Output:

Example 3: Load a DataFrame from CSV and then Convert 

In this method, we can easily read the CSV file in Pandas Dataframe as well as in Pyspark Dataframe. The dataset used here is heart.csv.

Python3




# import the pandas library
import pandas as pd
 
# Read the Dataset in Pandas Dataframe
df_pd = pd.read_csv('heart.csv')
 
# Show the dataset here head()
# will return top 5 rows
df_pd.head()


Output:

Python3




# Reading the csv file in
# Pyspark DataFrame
df_spark2 = spark.read.option(
  'header', 'true').csv("heart.csv")
 
# Showing the data in the form of
# table and showing only top 5 rows
df_spark2.show(5)


Output:

We can also convert pyspark Dataframe to pandas Dataframe. For this, we will use DataFrame.toPandas() method.

Syntax: DataFrame.toPandas()

Returns the contents of this DataFrame as Pandas pandas.DataFrame.

Python3




# Convert Pyspark DataFrame to
# Pandas DataFrame by toPandas()
# Function head() will show only
# top 5 rows of the dataset
df_spark2.toPandas().head()


Output:



Like Article
Suggest improvement
Previous
Next
Share your thoughts in the comments

Similar Reads