Open In App

How to split rows of a Spark RDD by Deliminator

Last Updated : 16 Oct, 2023
Improve
Improve
Like Article
Like
Save
Share
Report

For processing huge datasets, Apache Spark is a potent distributed computing system. A fundamental concept that describes an immutable distributed collection of objects in Spark is called a Resilient Distributed Dataset (RDD). Splitting the rows of an RDD based on a delimiter is a typical Spark task. For parsing structured data, like CSV or TSV files, this can be helpful. In this article, we will learn how to split the rows of a Spark RDD based on delimiter in Python.

Splitting Rows of a Spark RDD by Delimitor

Resilient Distributed Datasets (RDDs) are a core abstraction used in Apache Spark to describe a distributed group of immutable objects that may be processed concurrently over a cluster of computers. Splitting the rows of an RDD based on a delimiter is a typical Spark task. Spark is able to handle big datasets in parallel by employing the methods and objects to distribute the computation over a cluster of computers.

We can use the map transformation to apply a function to each element of an RDD to divide rows of an RDD by a delimiter. Based on the delimiter, the function should divide the row into an array of values and return the array as a new element.

Steps to split Spark RDD Rows by Delimitor in Python

Let us see a step-by-step process of how to divide rows of an RDD when a delimiter is provided.

Step 1: Import the required Modules

First of all, we will import the Python PySpark module for Spark RDD.

from pyspar.sql import SparkSession

Step 2: Create a Spark Session

Then, create a spark session. The SparkSession.builder object is a builder method that is used to set up and create the SparkSession. If a SparkSession with the specified name already exists, the getOrCreate() method returns it; otherwise, it creates a new one.

spark = SparkSession.builder.appName("SplitRowsByDelimiter").getOrCreate()

Step 3: Create an RDD

Before we divide an RDD’s rows, we must first make an RDD of strings. We can accomplish this by reading data from a file or by using the parallelize method to create an RDD from a list of strings.

data = spark.sparkContext.parallelize(["apple,orange,banana", "carrot,tomato,potato"])

Step 4: Define a Split Rows function

Once we have an RDD of strings, we must define a function to divide each row based on a delimiter into an array of values. For instance, we can define a function that uses the split method to divide each row by a comma.

def split_row(row):
    return row.split(",")

Step 5: Split all the Rows

After creating the function, we can use the map transformation to apply it to each row of the RDD. A function is passed as a parameter to the map transformation, which applies the function to each element of the RDD and then creates a new RDD with the altered items.

split_rdd = rdd.map(split_row)

Step 6: Collect the result as an array list

After applying the function to each row using the map transformation, we will use the collect action to gather the result as an array list. The driver program receives all of the RDD’s elements via the collect action, which may then be processed like any other list.

result = split_rdd.collect()

Examples of Splitting the Rows of a Spark RDD by Delimitor

Let us see a few examples, by which we can split the Spark rows based on different delimiter.

Example 1: Splitting Rows by comma

In this example, let us say we have an RDD of strings where each row contains a list of values separated by commas. We use the parallelize method to generate an RDD from comma-separated strings. Using the split method, we build the function split_row, which uses the map transformation to apply to each row, splitting each row by a comma delimiter. Finally, we use the collect action to collect the result. The RDD’s entire data set is pulled out by the collect() function, which then returns it as a list. We’ll loop through the list in and print each entry to the console.

Python3




# Import required modules
from pyspark.sql import SparkSession
 
# Create a SparkSession
spark = SparkSession.builder.appName("SplitRowsByDelimiter").getOrCreate()
 
# Create an RDD with sample data
rdd = spark.sparkContext.parallelize(["apple,orange,banana", "carrot,tomato,potato"])
 
# define a function to split each row by a comma
def split_row(row):
    return row.split(",")
 
# apply the function to each row using the map transformation
split_rdd = rdd.map(split_row)
 
# collect the result as a list of arrays
result = split_rdd.collect()
 
# print the result
for row in result:
    print(row)


Output:

['apple', 'orange', 'banana']
['carrot', 'tomato', 'potato']

Example 2: Splitting Rows by tab delimiter

In this example, let us say we have an RDD of strings where each row contains a list of values separated by tabs. we use the parallelize function to generate an RDD from tab-separated strings. The ‘\t‘ character can be used to divide each row into an array of values and return the array as a new element to divide the rows by the tab delimiter.

We will apply a transformation to each row of the RDD using the map() method. The lambda function that specifies the transformation to be applied to each element is the parameter to the map() function. In this instance, we’ll use the split() method to divide each row by the delimiter.

Python3




# Import required modules
from pyspark.sql import SparkSession
 
# Create a SparkSession
spark = SparkSession.builder.appName("
                  SplitRowsByDelimiter").getOrCreate()
 
# Create an RDD with sample data
rdd = spark.sparkContext.parallelize(["foo\tbar\tbaz",
                                      "hello\tworld"])
 
# Define the delimiter
delimiter = "\t"
 
# Split the rows by the delimiter
split_data = rdd.map(lambda row: row.split(delimiter))
 
# Print the resulting RDD
for row in split_data.collect():
    print(row)


Output:

['foo', 'bar', 'baz']
['hello', 'world']

Example 3: Splitting Rows by space delimiter

In this example, let us say we have an RDD of strings where each row contains a list of values separated by spaces. Use the split method with no arguments to divide the rows up into arrays of values according to the space delimiter, returning the array as a new element.

Python3




# Import required modules
from pyspark.sql import SparkSession
 
# Create a SparkSession
spark = SparkSession.builder.appName(
                  "SplitRowsByDelimiter").getOrCreate()
 
# Create an RDD with sample data
rdd = spark.sparkContext.parallelize(["Geeks for Geeks",
                                      "hello world"])
 
# define a function to split each row by a space delimiter
def split_row(row):
    return row.split()
 
# apply the function to each row using the map transformation
split_rdd = rdd.map(split_row)
 
# collect the result as a list of arrays
result = split_rdd.collect()
 
# Print the resulting RDD
for row in split_data.collect():
    print(row)


Output:

['foo', 'bar', 'baz']
['hello', 'world']


Like Article
Suggest improvement
Share your thoughts in the comments

Similar Reads