Open In App

Split Dataframe in Row Index in Pyspark

Last Updated : 28 Dec, 2022
Improve
Improve
Like Article
Like
Save
Share
Report

In this article, we are going to learn about splitting Pyspark data frame by row index in Python.

In data science. there is a bulk of data and their is need of data processing and lots of modules, functions and methods are available to process data. In this article we are going to process data by splitting dataframe by row indexing using Pyspark in Python.

Modules Required:

Pyspark: The API which was introduced to support Spark and Python language and has features of Scikit-learn and Pandas libraries of Python is known as Pyspark. This module can be installed through the following command in Python:

pip install pyspark

Stepwise Implementation:

Step 1: First of all, import the libraries, SparkSession, WIndow, monotonically_increasing_id, and ntile. The SparkSession library is used to create the session while the Window library operates on a group of rows and returns a single value for every input row. Also, the monotonically_increasing_id library is a column that generates monotonically increasing 64-bit integers and ntile library is used to return the ntile group id (from 1 to n inclusive) in an ordered window partition.

from pyspark.sql import SparkSession
from pyspark.sql.window import Window
from pyspark.sql.functions import monotonically_increasing_id, ntile

Step 2: Now, create a spark session using the getOrCreate function.

spark_session = SparkSession.builder.getOrCreate()

Step 3: Then, either create the data frame from the list of strings or read the data frame from the CSV file.

values = [#Declare the list of strings]
data_frame = spark_session.createDataFrame(values, ('value',))

or 

data_frame=csv_file = spark_session.read.csv('#Path of CSV file', sep = ',', inferSchema = True, header = True)

Step 4: Later on, create a function that when called will split the Pyspark data frame by row index.

def split_by_row_index(df, number_of_partitions=#Number_of_partitions):

Step 4.1: Further, assign a row_id column that has the row order for the data frame using the monotonically_increasing_id function.

    updated_df = df.withColumn('_row_id', monotonically_increasing_id())

Step 4.2: Moreover, whenever the partition occurs in a data frame, break the continuous increasing sequence. For breaking the sequence, we have used ntile function that returns the ntile group id for partitioned columns.

    updated_df = updated_df.withColumn('_partition', ntile(number_of_partitions).over(Window.orderBy(updated_df._row_id))) 

Step 4.3: Next, return each entry of the split data frame according to row index and partitions.

    return [updated_df.filter(updated_df._partition == i+1).drop('_row_id', '_partition') for i in range(number_of_partitions)]

Step 5: Finally, call the function for each entry of the data frame to split the Pyspark data frame by row index.

[i.collect() for i in split_by_row_index(data_frame)]

Example 1:

In this example, we have created the data frame from the list of strings, and then we have split that according to the row index considering the partitions in mind and assigning a group Id to the partitions.

Python3




# Python program to split data frame by row index
  
# Import the libraries SparkSession, Window,
# monotonically_increasing_id, and ntile libraries
from pyspark.sql import SparkSession
from pyspark.sql.window import Window
from pyspark.sql.functions import monotonically_increasing_id, ntile
  
# Create a spark session using getOrCreate() function
spark_session = SparkSession.builder.getOrCreate()
  
# Define the list of which you want to create data frame
values = [(str(i),) for i in range(20)]
  
# Create the data frame from the list defined
data_frame = spark_session.createDataFrame(values, ('value',))
  
# Create a function which when called split
# Pyspark data frame by row index
  
  
def split_by_row_index(df, number_of_partitions=4):
  
    # Assign a row_id column that has the row order for the data frame
    updated_df = df.withColumn('_row_id',
                               monotonically_increasing_id())
  
    # Break the continuous increasing sequence and returns the ntile
    # group id for partitioned columns using ntile function
    updated_df = updated_df.withColumn('_partition', ntile(
        number_of_partitions).over(Window.orderBy(updated_df._row_id)))
  
    # Return the split data frame according to row index and partitions
    return [updated_df.filter(updated_df._partition == i+1).drop(
      '_row_id', '_partition') for i in range(number_of_partitions)]
  
  
# Call the function for each entry of the data frame
[i.collect() for i in split_by_row_index(data_frame)]


Output:

[[Row(value='0'),
  Row(value='1'),
  Row(value='2'),
  Row(value='3'),
  Row(value='4')],
 [Row(value='5'),
  Row(value='6'),
  Row(value='7'),
  Row(value='8'),
  Row(value='9')],
 [Row(value='10'),
  Row(value='11'),
  Row(value='12'),
  Row(value='13'),
  Row(value='14')],
 [Row(value='15'),
  Row(value='16'),
  Row(value='17'),
  Row(value='18'),
  Row(value='19')]]

Example 2:

In this example, we have read the CSV file (link), i.e., the dataset of 5×5, and then we have split that according to the row index considering the partitions in mind and assigning a group Id to the partitions.

Python3




# Python program to split data frame by row index
  
# Import the libraries SparkSession, Window,
# monotonically_increasing_id, and ntile libraries
from pyspark.sql import SparkSession
from pyspark.sql.window import Window
from pyspark.sql.functions import monotonically_increasing_id, ntile
  
# Create a spark session using getOrCreate() function
spark_session = SparkSession.builder.getOrCreate()
  
# Read the CSV file
data_frame = csv_file = spark_session.read.csv(
    '/content/class_data.csv', sep=',', inferSchema=True,
     header=True)
  
# Create a function which when called split
# Pyspark data frame by row index
  
  
def split_by_row_index(df, number_of_partitions=4):
  
    # Assign a row_id column that has the row order for the data frame
    updated_df = df.withColumn('_row_id', monotonically_increasing_id())
  
    # Break the continuous increasing sequence and returns the ntile
    # group id for partitioned columns using ntile function
    updated_df = updated_df.withColumn('_partition', ntile(
        number_of_partitions).over(Window.orderBy(updated_df._row_id)))
  
    # Return the split data frame according to row index and partitions
    return [updated_df.filter(updated_df._partition == i+1).drop(
      '_row_id', '_partition') for i in range(number_of_partitions)]
  
  
# Call the function for each entry of the data frame
[i.collect() for i in split_by_row_index(data_frame)]


Output:

[[Row(name='Arun', subject='Maths', class=10, fees=12000, discount=400),
  Row(name='Aniket', subject='Social Science', class=11, fees=15000, discount=600)],
 [Row(name='Ishita', subject='English', class=9, fees=9000, discount=0)],
 [Row(name='Pranjal', subject='Science', class=12, fees=18000, discount=1000)],
 [Row(name='Vinayak', subject='Computer', class=12, fees=18000, discount=500)]]


Like Article
Suggest improvement
Share your thoughts in the comments

Similar Reads