Split Dataframe in Row Index in Pyspark

Last Updated : 28 Dec, 2022

In this article, we are going to learn about splitting Pyspark data frame by row index in Python.

In data science. there is a bulk of data and their is need of data processing and lots of modules, functions and methods are available to process data. In this article we are going to process data by splitting dataframe by row indexing using Pyspark in Python.

Modules Required:

Pyspark: The API which was introduced to support Spark and Python language and has features of Scikit-learn and Pandas libraries of Python is known as Pyspark. This module can be installed through the following command in Python:

pip install pyspark

Stepwise Implementation:

Step 1: First of all, import the libraries, SparkSession, WIndow, monotonically_increasing_id, and ntile. The SparkSession library is used to create the session while the Window library operates on a group of rows and returns a single value for every input row. Also, the monotonically_increasing_id library is a column that generates monotonically increasing 64-bit integers and ntile library is used to return the ntile group id (from 1 to n inclusive) in an ordered window partition.

from pyspark.sql import SparkSession
from pyspark.sql.window import Window
from pyspark.sql.functions import monotonically_increasing_id, ntile

Step 2: Now, create a spark session using the getOrCreate function.

spark_session = SparkSession.builder.getOrCreate()

Step 3: Then, either create the data frame from the list of strings or read the data frame from the CSV file.

values = [#Declare the list of strings]
data_frame = spark_session.createDataFrame(values, ('value',))

data_frame=csv_file = spark_session.read.csv('#Path of CSV file', sep = ',', inferSchema = True, header = True)

Step 4: Later on, create a function that when called will split the Pyspark data frame by row index.

def split_by_row_index(df, number_of_partitions=#Number_of_partitions):

Step 4.1: Further, assign a row_id column that has the row order for the data frame using the monotonically_increasing_id function.

    updated_df = df.withColumn('_row_id', monotonically_increasing_id())

Step 4.2: Moreover, whenever the partition occurs in a data frame, break the continuous increasing sequence. For breaking the sequence, we have used ntile function that returns the ntile group id for partitioned columns.

    updated_df = updated_df.withColumn('_partition', ntile(number_of_partitions).over(Window.orderBy(updated_df._row_id)))

Step 4.3: Next, return each entry of the split data frame according to row index and partitions.

    return [updated_df.filter(updated_df._partition == i+1).drop('_row_id', '_partition') for i in range(number_of_partitions)]

Step 5: Finally, call the function for each entry of the data frame to split the Pyspark data frame by row index.

[i.collect() for i in split_by_row_index(data_frame)]

Example 1:

In this example, we have created the data frame from the list of strings, and then we have split that according to the row index considering the partitions in mind and assigning a group Id to the partitions.

Python3

# Python program to split data frame by row index 
  
# Import the libraries SparkSession, Window, 
# monotonically_increasing_id, and ntile libraries 
from pyspark.sql import SparkSession 
from pyspark.sql.window import Window 
from pyspark.sql.functions import monotonically_increasing_id, ntile 
  
# Create a spark session using getOrCreate() function 
spark_session = SparkSession.builder.getOrCreate() 
  
# Define the list of which you want to create data frame 
values = [(str(i),) for i in range(20)] 
  
# Create the data frame from the list defined 
data_frame = spark_session.createDataFrame(values, ('value',)) 
  
# Create a function which when called split 
# Pyspark data frame by row index 
  
  
def split_by_row_index(df, number_of_partitions=4): 
  
    # Assign a row_id column that has the row order for the data frame 
    updated_df = df.withColumn('_row_id', 
                               monotonically_increasing_id()) 
  
    # Break the continuous increasing sequence and returns the ntile 
    # group id for partitioned columns using ntile function 
    updated_df = updated_df.withColumn('_partition', ntile( 
        number_of_partitions).over(Window.orderBy(updated_df._row_id))) 
  
    # Return the split data frame according to row index and partitions 
    return [updated_df.filter(updated_df._partition == i+1).drop( 
      '_row_id', '_partition') for i in range(number_of_partitions)] 
  
  
# Call the function for each entry of the data frame 
[i.collect() for i in split_by_row_index(data_frame)] 

Output:

[[Row(value='0'),
  Row(value='1'),
  Row(value='2'),
  Row(value='3'),
  Row(value='4')],
 [Row(value='5'),
  Row(value='6'),
  Row(value='7'),
  Row(value='8'),
  Row(value='9')],
 [Row(value='10'),
  Row(value='11'),
  Row(value='12'),
  Row(value='13'),
  Row(value='14')],
 [Row(value='15'),
  Row(value='16'),
  Row(value='17'),
  Row(value='18'),
  Row(value='19')]]

Example 2:

In this example, we have read the CSV file (link), i.e., the dataset of 5×5, and then we have split that according to the row index considering the partitions in mind and assigning a group Id to the partitions.

Python3

# Python program to split data frame by row index 
  
# Import the libraries SparkSession, Window, 
# monotonically_increasing_id, and ntile libraries 
from pyspark.sql import SparkSession 
from pyspark.sql.window import Window 
from pyspark.sql.functions import monotonically_increasing_id, ntile 
  
# Create a spark session using getOrCreate() function 
spark_session = SparkSession.builder.getOrCreate() 
  
# Read the CSV file 
data_frame = csv_file = spark_session.read.csv( 
    '/content/class_data.csv', sep=',', inferSchema=True, 
     header=True) 
  
# Create a function which when called split 
# Pyspark data frame by row index 
  
  
def split_by_row_index(df, number_of_partitions=4): 
  
    # Assign a row_id column that has the row order for the data frame 
    updated_df = df.withColumn('_row_id', monotonically_increasing_id()) 
  
    # Break the continuous increasing sequence and returns the ntile 
    # group id for partitioned columns using ntile function 
    updated_df = updated_df.withColumn('_partition', ntile( 
        number_of_partitions).over(Window.orderBy(updated_df._row_id))) 
  
    # Return the split data frame according to row index and partitions 
    return [updated_df.filter(updated_df._partition == i+1).drop( 
      '_row_id', '_partition') for i in range(number_of_partitions)] 
  
  
# Call the function for each entry of the data frame 
[i.collect() for i in split_by_row_index(data_frame)] 

Output:

[[Row(name='Arun', subject='Maths', class=10, fees=12000, discount=400),
  Row(name='Aniket', subject='Social Science', class=11, fees=15000, discount=600)],
 [Row(name='Ishita', subject='English', class=9, fees=9000, discount=0)],
 [Row(name='Pranjal', subject='Science', class=12, fees=18000, discount=1000)],
 [Row(name='Vinayak', subject='Computer', class=12, fees=18000, discount=500)]]

Suggest improvement

PySpark - Split dataframe into equal number of rows

Share your thoughts in the comments

Split Dataframe in Row Index in Pyspark

Modules Required:

Stepwise Implementation:

Example 1:

Python3

Example 2:

Python3

Please Login to comment...

Similar Reads

What kind of Experience do you want to share?