Open In App

PySpark – Split dataframe by column value

Last Updated : 23 Jan, 2023
Improve
Improve
Like Article
Like
Save
Share
Report

A distributed collection of data grouped into named columns is known as a Pyspark data frame in Python

There occurs various circumstances in which you need only particular rows in the data frame. For this, you need to split the data frame according to the column value. This can be achieved either using the filter function or the where function. In this article, we will discuss both ways to split data frames by column value.

Ways to split Pyspark data frame by column value:

  • Using filter function
  • Using where function

Method 1: Using the filter function

The function used to filter the rows from the data frame based on the given condition or SQL expression is known as the filter function. In this way, we will see how we can split the data frame by column value using the filter function. What we will do is apply the condition in the filter function once with equal to and next with not equal to and display both the data frames.

Syntax: data_frame.filter(condition)

Example:

In this example, we have read a CSV file (link), i.e., basically a data set of 5*5 as follows:

 

Then, we split the data frame with column ‘Age‘ using the filter function when its value is 18 and when it is not. Finally, we displayed both data frames.

Python3




# PySpark - Split dataframe by
# column value using filter function
  
# Import the libraries SparkSession
from pyspark.sql import SparkSession
  
# Create a spark session using getOrCreate() function
spark_session = SparkSession.builder.getOrCreate()
  
# Read the CSV file
df=csv_file = spark_session.read.csv('student_data.csv'
                                     sep = ','
                                     inferSchema = True,
                                     header = True)
  
# Split data frame with age when value is 18 
df.filter(df.age == 18).show(truncate=False)
df.filter(df.age != 18).show(truncate=False)


Output:

 

Method 2: Using the where function

The function used to filter the rows from the data frame based on the given SQL expression or condition is known as the where function. In this way, we will see how we can split the data frame by column value using the where function. What we will do is apply the condition in the where function once with equal to and next with not equal to and display both the data frames.

Syntax: data_frame.where(condition)

In this example, we have read a CSV file (link), i.e., basically a data set of 5*5 as follows:

 

Then, we split the data frame with column ‘Age‘ using the where function when its value is 18 and when it is not. Finally, we displayed both data frames.

Python3




# PySpark - Split dataframe by column value using where function
  
# Import the libraries SparkSession
from pyspark.sql import SparkSession
  
# Create a spark session using getOrCreate() function
spark_session = SparkSession.builder.getOrCreate()
  
# Read the CSV file
df=csv_file = spark_session.read.csv('student_data.csv'
                                     sep = ','
                                     inferSchema = True
                                     header = True)
  
# Split data frame with age when value is 18 
df.where(df.age == 18).show(truncate=False)
df.where(df.age != 18).show(truncate=False)


Output:

 



Like Article
Suggest improvement
Share your thoughts in the comments

Similar Reads