Open In App

Optimize Conversion between PySpark and Pandas DataFrames

Last Updated : 01 Dec, 2022
Improve
Improve
Like Article
Like
Save
Share
Report

PySpark and Pandas are two open-source libraries that are used for doing data analysis and handling data in Python. Given below is a short description of both of them.

Conversion between PySpark and Pandas DataFrames

In this article, we are going to talk about how we can convert a PySpark DataFrame into a Pandas DataFrame and vice versa. Their conversion can be easily done in PySpark.

Converting Pandas DataFrame into a PySpark DataFrame

Here in, we’ll be converting a Pandas DataFrame into a PySpark DataFrame. First of all, we’ll import PySpark and Pandas libraries. Then we’ll start a session. later, we will create a Pandas DataFrame and convert it to PySpark DataFrame. To do that, we’ll make a PySpark DataFrame via the createDataFrame() method and store it in the same variable in which we stored the Pandas DataFrame. Inside the createDataFrame() method, as a parameter, we’ll pass the pandas DataFrame name. These steps will convert the Pandas DataFrame into a PySpark DataFrame.

Example:

Python3




# importing pandas and PySpark libraries
import pandas as pd
import pyspark
  
# initializing the PySpark session
spark = pyspark.sql.SparkSession.builder.getOrCreate()
  
# creating a pandas DataFrame
df = pd.DataFrame({
  'Cardinal':[1, 2, 3],
  'Ordinal':['First','Second','Third']
})
  
# converting the pandas DataFrame into a PySpark DataFrame
df = spark.createDataFrame(df)
  
# printing the first two rows
df.show(2)


Output:

 

In case, if you would like to use the pandas DataFrame later, you can store the PySpark DataFrame in another variable.

Converting PySpark DataFrame into a Pandas DataFrame

Now, we will be converting a PySpark DataFrame into a Pandas DataFrame. All the steps are the same but this time, we’ll be making use of the toPandas() method. We’ll use toPandas() method and convert our PySpark DataFrame to a Pandas DataFrame.

Syntax to use toPandas() method:

spark_DataFrame.toPandas()

Example:

Python3




# importing PySpark Library
import pyspark
  
# from PySpark importing Row for creating DataFrame
from pyspark import Row
  
# initializing PySpark session
spark = pyspark.sql.SparkSession.builder.getOrCreate()
  
# creating a PySpark DataFrame
spark_df = spark.createDataFrame([
  Row(Cardinal=1, Ordinal='First'),
  Row(Cardinal=2, Ordinal='Second'),
  Row(Cardinal=3, Ordinal='Third')
])
  
# converting spark_dataframe into a pandas DataFrame
pandas_df = spark_df.toPandas()
  
pandas_df.head()


Output:

 

Now we will check the time required to do the above conversion.

Python3




%%time
import numpy as np
import pandas as pd
  
# creating session in PySpark
spark = pyspark.sql.SparkSession.builder.getOrCreate()
  
# creating a PySpark DataFrame
spark_df = spark.createDataFrame(pd.DataFrame(np.reshape\
           (np.random.randint(1, 101, size=100), newshape=(10, 10))))
spark_df.toPandas()


Output

3.17 s

Now let’s enable the PyArrow and see the time taken by the process.

Python3




%%time
import numpy as np
import pandas as pd
  
# creating session in PySpark
spark = pyspark.sql.SparkSession.builder.getOrCreate()
  
  
# creating a PySpark DataFrame
spark_df = spark.createDataFrame(pd.DataFrame(np.reshape\
           (np.random.randint(1, 101, size=100), newshape=(10, 10))))
  
# enabling PyArrow
spark.conf.set('spark.sql.execution.arrow.enabled', 'true')
spark_df.toPandas()


Output

460 ms

Here we can see that the time required to convert PySpark and Pandas dataframe has been reduced drastically by using the optimized version.



Like Article
Suggest improvement
Share your thoughts in the comments

Similar Reads