How to Convert Pandas to PySpark DataFrame ?
Last Updated :
22 Mar, 2023
In this article, we will learn How to Convert Pandas to PySpark DataFrame. Sometimes we will get csv, xlsx, etc. format data, and we have to store it in PySpark DataFrame and that can be done by loading data in Pandas then converted PySpark DataFrame. For conversion, we pass the Pandas dataframe into the CreateDataFrame() method.
Syntax: spark.createDataframe(data, schema)
Parameter:
- data – list of values on which dataframe is created.
- schema – It’s the structure of dataset or list of column names.
where spark is the SparkSession object.
Example 1: Create a DataFrame and then Convert using spark.createDataFrame() method
Python3
import pandas as pd
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName(
"pandas to spark" ).getOrCreate()
data = pd.DataFrame({ 'State' : [ 'Alaska' , 'California' ,
'Florida' , 'Washington' ],
'city' : [ "Anchorage" , "Los Angeles" ,
"Miami" , "Bellevue" ]})
df_spark = spark.createDataFrame(data)
df_spark.show()
|
Output:
Example 2: Create a DataFrame and then Convert using spark.createDataFrame() method
In this method, we are using Apache Arrow to convert Pandas to Pyspark DataFrame.
Python3
import the pandas
import pandas as pd
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName(
"pandas to spark" ).getOrCreate()
data = pd.DataFrame({ 'State' : [ 'Alaska' , 'California' ,
'Florida' , 'Washington' ],
'city' : [ "Anchorage" , "Los Angeles" ,
"Miami" , "Bellevue" ]})
spark.conf. set ( "spark.sql.execution.arrow.enabled" , "true" )
sprak_arrow = spark.createDataFrame(data)
sprak_arrow.show()
|
Output:
Example 3: Load a DataFrame from CSV and then Convert
In this method, we can easily read the CSV file in Pandas Dataframe as well as in Pyspark Dataframe. The dataset used here is heart.csv.
Python3
import pandas as pd
df_pd = pd.read_csv( 'heart.csv' )
df_pd.head()
|
Output:
Python3
df_spark2 = spark.read.option(
'header' , 'true' ).csv( "heart.csv" )
df_spark2.show( 5 )
|
Output:
We can also convert pyspark Dataframe to pandas Dataframe. For this, we will use DataFrame.toPandas() method.
Syntax: DataFrame.toPandas()
Returns the contents of this DataFrame as Pandas pandas.DataFrame.
Python3
df_spark2.toPandas().head()
|
Output:
Share your thoughts in the comments
Please Login to comment...