Skip to content
Related Articles

Related Articles

Improve Article

Converting a PySpark DataFrame Column to a Python List

  • Last Updated : 18 Jul, 2021

In this article, we will discuss how to convert Pyspark dataframe column to a Python list.

Creating dataframe for demonstration:

Python3




# importing module
import pyspark
  
# importing sparksession from pyspark.sql module
from pyspark.sql import SparkSession
  
# creating sparksession and giving an app name
spark = SparkSession.builder.appName('sparkdf').getOrCreate()
  
# list  of students  data
data = [["1", "sravan", "vignan", 67, 89],
        ["2", "ojaswi", "vvit", 78, 89],
        ["3", "rohith", "vvit", 100, 80],
        ["4", "sridevi", "vignan", 78, 80],
        ["1", "sravan", "vignan", 89, 98],
        ["5", "gnanesh", "iit", 94, 98]]
  
# specify column names
columns = ['student ID', 'student NAME',
           'college', 'subject1', 'subject2']
  
# creating a dataframe from the lists of data
dataframe = spark.createDataFrame(data, columns)
  
# display dataframe
dataframe.show()

Output:



Method 1: Using flatMap()

This method takes the selected column as the input which uses rdd and converts it into the list.

Syntax: dataframe.select(‘Column_Name’).rdd.flatMap(lambda x: x).collect()

where,

  • dataframe is the pyspark dataframe
  • Column_Name is the column to be converted into the list
  • flatMap() is the method available in rdd which takes a lambda expression as a parameter and converts the column into list
  • collect() is used to collect the data in the columns

Example 1: Python code to convert particular column to list using flatMap

Python3




# convert student Name to list using 
# flatMap
print(dataframe.select('student Name').
      rdd.flatMap(lambda x: x).collect())
  
# convert student ID to list using 
# flatMap
print(dataframe.select('student ID').
      rdd.flatMap(lambda x: x).collect())

Output:

[‘sravan’, ‘ojaswi’, ‘rohith’, ‘sridevi’, ‘sravan’, ‘gnanesh’]

[‘1’, ‘2’, ‘3’, ‘4’, ‘1’, ‘5’]



Example 2: Convert multiple columns to list.

Python3




# convert multiple columns  to list using flatMap
print(dataframe.select(['student Name',
                        'student Name',
                        'college']).
      rdd.flatMap(lambda x: x).collect())

Output:

[‘sravan’, ‘sravan’, ‘vignan’, ‘ojaswi’, ‘ojaswi’, ‘vvit’, ‘rohith’, ‘rohith’, ‘vvit’, ‘sridevi’, ‘sridevi’, ‘vignan’, ‘sravan’, ‘sravan’,  ‘vignan’, ‘gnanesh’, ‘gnanesh’, ‘iit’]

Method 2: Using map()

This function is used to map the given dataframe column to list

Syntax: dataframe.select(‘Column_Name’).rdd.map(lambda x : x[0]).collect()

where,

  • dataframe is the pyspark dataframe
  • Column_Name is the column to be converted into the list
  • map() is the method available in rdd which takes a lambda expression as a parameter and converts the column into list
  • collect() is used to collect the data in the columns

Example: Python code to convert pyspark dataframe column to list using the map function.

Python3




# convert  student Name  to list using map
print(dataframe.select('student Name').
      rdd.map(lambda x : x[0]).collect())
  
# convert  student ID  to list using map
print(dataframe.select('student ID').
      rdd.map(lambda x : x[0]).collect())
  
# convert  student college  to list using 
# map
print(dataframe.select('college').
      rdd.map(lambda x : x[0]).collect())

Output:



[‘sravan’, ‘ojaswi’, ‘rohith’, ‘sridevi’, ‘sravan’, ‘gnanesh’]

[‘1’, ‘2’, ‘3’, ‘4’, ‘1’, ‘5’]

[‘vignan’, ‘vvit’, ‘vvit’, ‘vignan’, ‘vignan’, ‘iit’]

Method 3: Using collect()

Collect is used to collect the data from the dataframe, we will use a comprehension data structure to get pyspark dataframe column to list with collect() method.

Syntax: [data[0] for data in dataframe.select(‘column_name’).collect()]

Where,

  • dataframe is the pyspark dataframe
  • data is the iterator of the dataframe column
  • column_name is the column in the dataframe

Example: Python code to convert dataframe columns to list using collect() method

Python3




# display college column in
# the list format using comphrension
print([data[0] for data in dataframe.
       select('college').collect()])
  
  
# display student ID column in the
# list format using comphrension
print([data[0] for data in dataframe.
       select('student ID').collect()])
  
# display subject1  column in the list
# format using comphrension
print([data[0] for data in dataframe.
       select('subject1').collect()])
  
# display subject2  column in the
# list format using comphrension
print([data[0] for data in dataframe.
       select('subject2').collect()])

Output:

['vignan', 'vvit', 'vvit', 'vignan', 'vignan', 'iit']
['1', '2', '3', '4', '1', '5']
[67, 78, 100, 78, 89, 94]
[89, 89, 80, 80, 98, 98]

Method 4: Using toLocalIterator()

This method is used to iterate the column values in the dataframe, we will use a comprehension data structure to get pyspark dataframe column to list with toLocalIterator() method.



Syntax: [data[0] for data in dataframe.select(‘column_name’).toLocalIterator()]

Where,

  • dataframe is the pyspark dataframe
  • data is the iterator of the dataframe column
  • column_name is the column in the dataframe

Example: Convert pyspark dataframe columns to list using toLocalIterator() method

Python3




# display college column in the list
# format using comphrension
print([data[0] for data in dataframe.
       select('college').collect()])
  
  
# display student ID column in the
# list format using comphrension
print([data[0] for data in dataframe.
       select('student ID').toLocalIterator()])
  
# display subject1  column in the list
# format using comphrension
print([data[0] for data in dataframe.
       select('subject1').toLocalIterator()])
  
# display subject2  column in the
# list format using comphrension
print([data[0] for data in dataframe.
       select('subject2').toLocalIterator()])

Output:

['vignan', 'vvit', 'vvit', 'vignan', 'vignan', 'iit']
['1', '2', '3', '4', '1', '5']
[67, 78, 100, 78, 89, 94]
[89, 89, 80, 80, 98, 98]

Method 5: Using toPandas()

Used to convert a column to dataframe, and then we can convert it into a list.

Syntax: list(dataframe.select(‘column_name’).toPandas()[‘column_name’])

Where,

  • toPandas() is used to convert particular column to dataframe
  • column_name is the column in the pyspark dataframe

Example: Convert pyspark dataframe columns to list using toPandas() method

Python3






# display college  column in
# the list format using toPandas
print(list(dataframe.select('college').
           toPandas()['college']))
  
  
# display student NAME  column in
# the list format using toPandas
print(list(dataframe.select('student NAME').
           toPandas()['student NAME']))
  
# display subject1  column in
# the list format using toPandas
print(list(dataframe.select('subject1').
           toPandas()['subject1']))
  
# display subject2  column
# in the list format using toPandas
print(list(dataframe.select('subject2').
           toPandas()['subject2']))

Output:

[‘vignan’, ‘vvit’, ‘vvit’, ‘vignan’, ‘vignan’, ‘iit’]

[‘sravan’, ‘ojaswi’, ‘rohith’, ‘sridevi’, ‘sravan’, ‘gnanesh’]

[67, 78, 100, 78, 89, 94]

[89, 89, 80, 80, 98, 98]

 Attention geek! Strengthen your foundations with the Python Programming Foundation Course and learn the basics.  

To begin with, your interview preparations Enhance your Data Structures concepts with the Python DS Course. And to begin with your Machine Learning Journey, join the Machine Learning – Basic Level Course




My Personal Notes arrow_drop_up
Recommended Articles
Page :