Python PySpark – DataFrame filter on multiple columns
Last Updated :
14 Sep, 2021
In this article, we are going to filter the dataframe on multiple columns by using filter() and where() function in Pyspark in Python.
Creating Dataframe for demonestration:
Python3
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName( 'sparkdf' ).getOrCreate()
data = [[ 1 , "sravan" , "company 1" ],
[ 2 , "ojaswi" , "company 1" ],
[ 3 , "rohith" , "company 2" ],
[ 4 , "sridevi" , "company 1" ],
[ 1 , "sravan" , "company 1" ],
[ 4 , "sridevi" , "company 1" ]]
columns = [ 'ID' , 'NAME' , 'Company' ]
dataframe = spark.createDataFrame(data, columns)
dataframe.show()
|
Output:
Method 1: Using filter() Method
filter() is used to return the dataframe based on the given condition by removing the rows in the dataframe or by extracting the particular rows or columns from the dataframe. We are going to filter the dataframe on multiple columns. It can take a condition and returns the dataframe.
Syntax:
filter(dataframe.column condition)
Example 1: Conditional operator includes boolean or logical or relational operators.
Python3
dataframe. filter (dataframe. ID < 3 ).show()
|
Output:
Example 2: Python program to filter data based on two columns. In this example, we created a pyspark dataframe and select dataframe where ID less than 3 or name is Sridevi
Python3
dataframe. filter ((dataframe. ID < 3 ) |
(dataframe.NAME = = 'sridevi' )).show()
|
Output:
Example 3: Multiple columns filtering
Python3
dataframe. filter ((dataframe. ID < 3 ) | (
(dataframe.NAME = = 'sridevi' ) &
(dataframe.Company = = 'company 1' ))).show()
|
Output:
Method 2: where() method
Where: where is similar to filter() function that is used to return the dataframe based on the given condition by removing the rows in the dataframe or by extracting the particular rows or columns from the dataframe. It can take a condition and returns the dataframe.
where(dataframe.column condition)
Example 1: Python program to filter on multiple columns
Python3
dataframe.where((dataframe. ID < 3 ) | (
(dataframe.NAME = = 'sridevi' ) &
(dataframe.Company = = 'company 1' ))).show()
|
Output:
Share your thoughts in the comments
Please Login to comment...