PySpark dataframe foreach to fill a list
Last Updated :
01 Mar, 2023
In this article, we are going to learn how to make a list of rows in Pyspark dataframe using foreach using Pyspark in Python.
PySpark is a powerful open-source library for working on large datasets in the Python programming language. It is designed for distributed computing and it is commonly used for data manipulation and analysis tasks. By using parallel processing techniques It allows users to easily and efficiently process large amounts of data.
Data frame structure is one of the key features of PySpark, which makes it to manipulate and analyze data in a tabular format. Dataframe is similar to traditional spreadsheet or SQL table structures which provide a variety of functions and methods for manipulating and analyzing data.
Dataframes and their importance in PySpark
In PySpark, a data frame is a distributed collection of data that is organized into rows and columns. Which is similar to a spreadsheet or a SQL table, and it is an essential part of PySpark for data manipulation and analysis. Dataframes allows users to easily manipulate, filter, and transform data, and also provide a wide range of functions and methods for working with data. Key advantage of data frames are their ability to scale to large amounts of data. As data frames are distributed across a cluster of machines, they can handle very large datasets without memory constraints. Because of this it makes it ideal for working with big data and for running complex queries and operations on large datasets.
Creating a Pyspark data frame with the list
In this we are going to create Pyspark data frame using list of tuples by defining its schema using StructType() and then create data frame using createDataFrame() function. Here are the step to create a PySpark data frame from a list.
Python3
from pyspark.sql.types import StructType, StructField, IntegerType, StringType
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName(
"PySpark fill list" ).getOrCreate()
data = [( 1 , "John" ), ( 2 , "Mike" ),
( 3 , "Sara" )]
schema = StructType([
StructField( "id" , IntegerType()),
StructField( "name" , StringType())
])
df = spark.createDataFrame(data, schema)
df.show()
|
Output :
Using foreach to fill a list from Pyspark data frame
foreach() is used to iterate over the rows in a PySpark data frame and using this we are going to add the data from each row to a list. The foreach() function is an action and it is executed on the driver node and not on the worker nodes. This means that it is not recommended to use foreach() when the data is large and distributed. Here are the steps of using the foreach() function to fill a list with data from a PySpark data frame:
Python3
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName(
'Example' ).getOrCreate()
df = spark.createDataFrame(
[( 'Alice' , 25 ), ( 'Bob' , 30 ),
( 'Charlie' , 35 )], [ 'name' , 'age' ])
result = []
df.foreach( lambda row: result.append((row.name,
row.age)))
result = df.collect()
print (result)
|
Output :
Share your thoughts in the comments
Please Login to comment...