PySpark dataframe foreach to fill a list

Last Updated : 01 Mar, 2023

In this article, we are going to learn how to make a list of rows in Pyspark dataframe using foreach using Pyspark in Python.

PySpark is a powerful open-source library for working on large datasets in the Python programming language. It is designed for distributed computing and it is commonly used for data manipulation and analysis tasks. By using parallel processing techniques It allows users to easily and efficiently process large amounts of data.

Data frame structure is one of the key features of PySpark, which makes it to manipulate and analyze data in a tabular format. Dataframe is similar to traditional spreadsheet or SQL table structures which provide a variety of functions and methods for manipulating and analyzing data.

Dataframes and their importance in PySpark

In PySpark, a data frame is a distributed collection of data that is organized into rows and columns. Which is similar to a spreadsheet or a SQL table, and it is an essential part of PySpark for data manipulation and analysis. Dataframes allows users to easily manipulate, filter, and transform data, and also provide a wide range of functions and methods for working with data. Key advantage of data frames are their ability to scale to large amounts of data. As data frames are distributed across a cluster of machines, they can handle very large datasets without memory constraints. Because of this it makes it ideal for working with big data and for running complex queries and operations on large datasets.

Creating a Pyspark data frame with the list

In this we are going to create Pyspark data frame using list of tuples by defining its schema using StructType() and then create data frame using createDataFrame() function. Here are the step to create a PySpark data frame from a list.

Python3

from pyspark.sql.types import StructType, StructField, IntegerType, StringType
from pyspark.sql import SparkSession
 
# create a SparkSession
spark = SparkSession.builder.appName(
  "PySpark fill list").getOrCreate()
 
# Create a list of tuples
data = [(1, "John"), (2, "Mike"),
        (3, "Sara")]
 
# Define the schema of the dataframe
schema = StructType([
    StructField("id", IntegerType()),
    StructField("name", StringType())
])
 
# Create a dataframe from the list
df = spark.createDataFrame(data, schema)
 
# Show the dataframe
df.show()

Output :

Using foreach to fill a list from Pyspark data frame

foreach() is used to iterate over the rows in a PySpark data frame and using this we are going to add the data from each row to a list. The foreach() function is an action and it is executed on the driver node and not on the worker nodes. This means that it is not recommended to use foreach() when the data is large and distributed. Here are the steps of using the foreach() function to fill a list with data from a PySpark data frame:

Python3

# Import the SparkSession class from the pyspark.sql module
from pyspark.sql import SparkSession
 
# Create a SparkSession with the specified app name
spark = SparkSession.builder.appName(
  'Example').getOrCreate()
 
# Create a DataFrame with three rows,
# containing the names and ages of three people
df = spark.createDataFrame(
    [('Alice', 25), ('Bob', 30),
     ('Charlie', 35)], ['name', 'age'])
 
# Initialize an empty list to store the results
result = []
 
# Perform an action on each row of the 
# DataFrame using the foreach() function
# In this case, the action is to append 
# the name and age of each row to the result list
df.foreach(lambda row: result.append((row.name,
                                      row.age)))
 
# Collect the rows of the DataFrame into 
# a list using the collect() function
result = df.collect()
 
# Print the resulting list of rows
print(result)