Optimize Conversion between PySpark and Pandas DataFrames
Last Updated :
01 Dec, 2022
PySpark and Pandas are two open-source libraries that are used for doing data analysis and handling data in Python. Given below is a short description of both of them.
Conversion between PySpark and Pandas DataFrames
In this article, we are going to talk about how we can convert a PySpark DataFrame into a Pandas DataFrame and vice versa. Their conversion can be easily done in PySpark.
Converting Pandas DataFrame into a PySpark DataFrame
Here in, we’ll be converting a Pandas DataFrame into a PySpark DataFrame. First of all, we’ll import PySpark and Pandas libraries. Then we’ll start a session. later, we will create a Pandas DataFrame and convert it to PySpark DataFrame. To do that, we’ll make a PySpark DataFrame via the createDataFrame() method and store it in the same variable in which we stored the Pandas DataFrame. Inside the createDataFrame() method, as a parameter, we’ll pass the pandas DataFrame name. These steps will convert the Pandas DataFrame into a PySpark DataFrame.
Example:
Python3
import pandas as pd
import pyspark
spark = pyspark.sql.SparkSession.builder.getOrCreate()
df = pd.DataFrame({
'Cardinal' :[ 1 , 2 , 3 ],
'Ordinal' :[ 'First' , 'Second' , 'Third' ]
})
df = spark.createDataFrame(df)
df.show( 2 )
|
Output:
In case, if you would like to use the pandas DataFrame later, you can store the PySpark DataFrame in another variable.
Converting PySpark DataFrame into a Pandas DataFrame
Now, we will be converting a PySpark DataFrame into a Pandas DataFrame. All the steps are the same but this time, we’ll be making use of the toPandas() method. We’ll use toPandas() method and convert our PySpark DataFrame to a Pandas DataFrame.
Syntax to use toPandas() method:
spark_DataFrame.toPandas()
Example:
Python3
import pyspark
from pyspark import Row
spark = pyspark.sql.SparkSession.builder.getOrCreate()
spark_df = spark.createDataFrame([
Row(Cardinal = 1 , Ordinal = 'First' ),
Row(Cardinal = 2 , Ordinal = 'Second' ),
Row(Cardinal = 3 , Ordinal = 'Third' )
])
pandas_df = spark_df.toPandas()
pandas_df.head()
|
Output:
Now we will check the time required to do the above conversion.
Python3
% % time
import numpy as np
import pandas as pd
spark = pyspark.sql.SparkSession.builder.getOrCreate()
spark_df = spark.createDataFrame(pd.DataFrame(np.reshape\
(np.random.randint( 1 , 101 , size = 100 ), newshape = ( 10 , 10 ))))
spark_df.toPandas()
|
Output
3.17 s
Now let’s enable the PyArrow and see the time taken by the process.
Python3
% % time
import numpy as np
import pandas as pd
spark = pyspark.sql.SparkSession.builder.getOrCreate()
spark_df = spark.createDataFrame(pd.DataFrame(np.reshape\
(np.random.randint( 1 , 101 , size = 100 ), newshape = ( 10 , 10 ))))
spark.conf. set ( 'spark.sql.execution.arrow.enabled' , 'true' )
spark_df.toPandas()
|
Output
460 ms
Here we can see that the time required to convert PySpark and Pandas dataframe has been reduced drastically by using the optimized version.
Share your thoughts in the comments
Please Login to comment...