PySpark – Create dictionary from data in two columns
Last Updated :
03 Jan, 2022
In this article, we are going to see how to create a dictionary from data in two columns in PySpark using Python.
Here we will create dataframe with two columns and then convert it into a dictionary using Dictionary comprehension.
Python
import pyspark
from pyspark.sql import SparkSession
spark_session = SparkSession.builder.appName(
'Practice_Session' ).getOrCreate()
rows = [[ 'John' , 54 ],
[ 'Adam' , 65 ],
[ 'Michael' , 56 ],
[ 'Kelsey' , 37 ],
[ 'Chris' , 49 ],
[ 'Jonathan' , 28 ],
[ 'Anthony' , 26 ],
[ 'Esther' , 48 ],
[ 'Rachel' , 52 ],
[ 'Joseph' , 56 ],
[ 'Richard' , 49 ],
]
columns = [ 'Name' , 'Age' ]
df_pyspark = spark_session.createDataFrame(rows, columns)
df_pyspark.show()
result_dict = {row[ 'Name' ]: row[ 'Age' ]
for row in df_pyspark.collect()}
print (result_dict[ 'John' ])
print (result_dict[ 'Michael' ])
print (result_dict[ 'Adam' ])
|
Output :
Method 2: Converting PySpark DataFrame and using to_dict() method
Here are the details of to_dict() method:
to_dict() : PandasDataFrame.to_dict(orient=’dict’)
Parameters:
- orient : str {‘dict’, ‘list’, ‘series’, ‘split’, ‘records’, ‘index’}
- Determines the type of the values of the dictionary.
Return: It returns a Python dictionary corresponding to the DataFrame
Python
import pyspark
from pyspark.sql import SparkSession
spark_session = SparkSession.builder.appName(
'Practice_Session' ).getOrCreate()
rows = [[ 'John' , 54 ],
[ 'Adam' , 65 ],
[ 'Michael' , 56 ],
[ 'Kelsey' , 37 ],
[ 'Chris' , 49 ],
[ 'Jonathan' , 28 ],
[ 'Anthony' , 26 ],
[ 'Esther' , 48 ],
[ 'Rachel' , 52 ],
[ 'Joseph' , 56 ],
[ 'Richard' , 49 ],
]
columns = [ 'Name' , 'Age' ]
df_pyspark = spark_session.createDataFrame(rows, columns)
df_pyspark.show()
df_pandas = df_pyspark.toPandas()
result = df_pandas.to_dict(orient = 'list' )
print (result)
|
Output :
Method 3: By iterating over a column of dictionary
Iterating through columns and producing a dictionary such that keys are columns and values are a list of values in columns.
For this, we need to first convert the PySpark DataFrame to a Pandas DataFrame
Python
import pyspark
from pyspark.sql import SparkSession
spark_session = SparkSession.builder.appName(
'Practice_Session' ).getOrCreate()
rows = [[ 'John' , 54 ],
[ 'Adam' , 65 ],
[ 'Michael' , 56 ],
[ 'Kelsey' , 37 ],
[ 'Chris' , 49 ],
[ 'Jonathan' , 28 ],
[ 'Anthony' , 26 ],
[ 'Esther' , 48 ],
[ 'Rachel' , 52 ],
[ 'Joseph' , 56 ],
[ 'Richard' , 49 ],
]
columns = [ 'Name' , 'Age' ]
df_pyspark = spark_session.createDataFrame(rows, columns)
df_pyspark.show()
result = {}
df_pandas = df_pyspark.toPandas()
for column in df_pandas.columns:
result[column] = df_pandas[column].values.tolist()
print (result)
|
Output :
Like Article
Suggest improvement
Share your thoughts in the comments
Please Login to comment...