Open In App

Optimize Conversion between PySpark and Pandas DataFrames

PySpark and Pandas are two open-source libraries that are used for doing data analysis and handling data in Python. Given below is a short description of both of them.

Conversion between PySpark and Pandas DataFrames

In this article, we are going to talk about how we can convert a PySpark DataFrame into a Pandas DataFrame and vice versa. Their conversion can be easily done in PySpark.



Converting Pandas DataFrame into a PySpark DataFrame

Here in, we’ll be converting a Pandas DataFrame into a PySpark DataFrame. First of all, we’ll import PySpark and Pandas libraries. Then we’ll start a session. later, we will create a Pandas DataFrame and convert it to PySpark DataFrame. To do that, we’ll make a PySpark DataFrame via the createDataFrame() method and store it in the same variable in which we stored the Pandas DataFrame. Inside the createDataFrame() method, as a parameter, we’ll pass the pandas DataFrame name. These steps will convert the Pandas DataFrame into a PySpark DataFrame.

Example:






# importing pandas and PySpark libraries
import pandas as pd
import pyspark
  
# initializing the PySpark session
spark = pyspark.sql.SparkSession.builder.getOrCreate()
  
# creating a pandas DataFrame
df = pd.DataFrame({
  'Cardinal':[1, 2, 3],
  'Ordinal':['First','Second','Third']
})
  
# converting the pandas DataFrame into a PySpark DataFrame
df = spark.createDataFrame(df)
  
# printing the first two rows
df.show(2)

Output:

 

In case, if you would like to use the pandas DataFrame later, you can store the PySpark DataFrame in another variable.

Converting PySpark DataFrame into a Pandas DataFrame

Now, we will be converting a PySpark DataFrame into a Pandas DataFrame. All the steps are the same but this time, we’ll be making use of the toPandas() method. We’ll use toPandas() method and convert our PySpark DataFrame to a Pandas DataFrame.

Syntax to use toPandas() method:

spark_DataFrame.toPandas()

Example:




# importing PySpark Library
import pyspark
  
# from PySpark importing Row for creating DataFrame
from pyspark import Row
  
# initializing PySpark session
spark = pyspark.sql.SparkSession.builder.getOrCreate()
  
# creating a PySpark DataFrame
spark_df = spark.createDataFrame([
  Row(Cardinal=1, Ordinal='First'),
  Row(Cardinal=2, Ordinal='Second'),
  Row(Cardinal=3, Ordinal='Third')
])
  
# converting spark_dataframe into a pandas DataFrame
pandas_df = spark_df.toPandas()
  
pandas_df.head()

Output:

 

Now we will check the time required to do the above conversion.




%%time
import numpy as np
import pandas as pd
  
# creating session in PySpark
spark = pyspark.sql.SparkSession.builder.getOrCreate()
  
# creating a PySpark DataFrame
spark_df = spark.createDataFrame(pd.DataFrame(np.reshape\
           (np.random.randint(1, 101, size=100), newshape=(10, 10))))
spark_df.toPandas()

Output

3.17 s

Now let’s enable the PyArrow and see the time taken by the process.




%%time
import numpy as np
import pandas as pd
  
# creating session in PySpark
spark = pyspark.sql.SparkSession.builder.getOrCreate()
  
  
# creating a PySpark DataFrame
spark_df = spark.createDataFrame(pd.DataFrame(np.reshape\
           (np.random.randint(1, 101, size=100), newshape=(10, 10))))
  
# enabling PyArrow
spark.conf.set('spark.sql.execution.arrow.enabled', 'true')
spark_df.toPandas()

Output

460 ms

Here we can see that the time required to convert PySpark and Pandas dataframe has been reduced drastically by using the optimized version.


Article Tags :