Optimize Conversion between PySpark and Pandas DataFrames

PySpark and Pandas are two open-source libraries that are used for doing data analysis and handling data in Python. Given below is a short description of both of them.

Conversion between PySpark and Pandas DataFrames

In this article, we are going to talk about how we can convert a PySpark DataFrame into a Pandas DataFrame and vice versa. Their conversion can be easily done in PySpark.

Converting Pandas DataFrame into a PySpark DataFrame

Here in, we’ll be converting a Pandas DataFrame into a PySpark DataFrame. First of all, we’ll import PySpark and Pandas libraries. Then we’ll start a session. later, we will create a Pandas DataFrame and convert it to PySpark DataFrame. To do that, we’ll make a PySpark DataFrame via the createDataFrame() method and store it in the same variable in which we stored the Pandas DataFrame. Inside the createDataFrame() method, as a parameter, we’ll pass the pandas DataFrame name. These steps will convert the Pandas DataFrame into a PySpark DataFrame.

Example:

Python3

# importing pandas and PySpark libraries 

import pandas as pd 

import pyspark 

# initializing the PySpark session 

spark = pyspark.sql.SparkSession.builder.getOrCreate() 

# creating a pandas DataFrame 

df = pd.DataFrame({ 

  'Cardinal':[1, 2, 3], 

  'Ordinal':['First','Second','Third'] 
}) 

# converting the pandas DataFrame into a PySpark DataFrame 

df = spark.createDataFrame(df) 

# printing the first two rows 

df.show(2)

Output:

In case, if you would like to use the pandas DataFrame later, you can store the PySpark DataFrame in another variable.

Converting PySpark DataFrame into a Pandas DataFrame

Now, we will be converting a PySpark DataFrame into a Pandas DataFrame. All the steps are the same but this time, we’ll be making use of the toPandas() method. We’ll use toPandas() method and convert our PySpark DataFrame to a Pandas DataFrame.

Syntax to use toPandas() method:

spark_DataFrame.toPandas()

Example:

Python3

# importing PySpark Library 

import pyspark 

# from PySpark importing Row for creating DataFrame 

from pyspark import Row 

# initializing PySpark session 

spark = pyspark.sql.SparkSession.builder.getOrCreate() 

# creating a PySpark DataFrame 

spark_df = spark.createDataFrame([ 

  Row(Cardinal=1, Ordinal='First'), 

  Row(Cardinal=2, Ordinal='Second'), 

  Row(Cardinal=3, Ordinal='Third') 
]) 

# converting spark_dataframe into a pandas DataFrame 

pandas_df = spark_df.toPandas() 

pandas_df.head()

Output:

Now we will check the time required to do the above conversion.

Python3

%%time 

import numpy as np 

import pandas as pd 

# creating session in PySpark 

spark = pyspark.sql.SparkSession.builder.getOrCreate() 

# creating a PySpark DataFrame 

spark_df = spark.createDataFrame(pd.DataFrame(np.reshape\ 

           (np.random.randint(1, 101, size=100), newshape=(10, 10)))) 
spark_df.toPandas()

Output

3.17 s

Now let’s enable the PyArrow and see the time taken by the process.

Python3

%%time 

import numpy as np 

import pandas as pd 

# creating session in PySpark 

spark = pyspark.sql.SparkSession.builder.getOrCreate() 

# creating a PySpark DataFrame 

spark_df = spark.createDataFrame(pd.DataFrame(np.reshape\ 

           (np.random.randint(1, 101, size=100), newshape=(10, 10)))) 

# enabling PyArrow 

spark.conf.set('spark.sql.execution.arrow.enabled', 'true') 
spark_df.toPandas()

Output

460 ms

Here we can see that the time required to convert PySpark and Pandas dataframe has been reduced drastically by using the optimized version.

Article Tags :

Python

Technical Scripter

Technical Scripter 2022