Open In App

Filtering a row in PySpark DataFrame based on matching values from a list

In this article, we are going to filter the rows in the dataframe based on matching values in the list by using isin in Pyspark dataframe

isin(): This is used to find the elements contains in a given dataframe, it will take the elements and get the elements to match to the data



Syntax: isin([element1,element2,.,element n])

Create Dataframe for demonstration:






# importing module
import pyspark
  
# importing sparksession
from pyspark.sql import SparkSession
  
# creating sparksession
# and giving an app name
spark = SparkSession.builder.appName('sparkdf').getOrCreate()
  
# list  of students  data  with null values
# we can define null values with none
data = [[1, "sravan", "vignan"],
        [2, "ramya", "vvit"],
        [3, "rohith", "klu"],
        [4, "sridevi", "vignan"],
        [5, "gnanesh", "iit"]]
  
# specify column names
columns = ['ID', 'NAME', 'college']
  
# creating a dataframe from the lists of data
dataframe = spark.createDataFrame(data, columns)
  
dataframe.show()

Output:

Method 1: Using filter() method

It is used to check the condition and give the results, Both are similar

Syntax: dataframe.filter(condition)

Where, condition is the dataframe condition.

Here we will use all the discussed methods.

Syntax: dataframe.filter((dataframe.column_name).isin([list_of_elements])).show()

where,

  • column_name is the column
  • elements are the values that are present in the column
  • show() is used to show the resultant dataframe

Example 1: Get the particular ID’s with filter() clause.




# get the ID : 1,2,3 from dataframe
dataframe.filter((dataframe.ID).isin([1,2,3])).show()

Output:

Example 2: Get ID’s not present in 1 and 3




# get the ID : not in 1 and 3 from dataframe
dataframe.filter(~(dataframe.ID).isin([1, 3])).show()

Output:

Example 3: Get names from dataframe.




# get name as sravan
dataframe.filter((
  dataframe.NAME).isin(['sravan'])).show()

Output:

Method 2: Using where() method

where() is used to check the condition and give the results

Syntax: dataframe.where(condition)

where, condition is the dataframe condition

Overall Syntax with where clause:

dataframe.where((dataframe.column_name).isin([elements])).show()

where,

  • column_name is the column
  • elements are the values that are present in the column
  • show() is used to show the resultant dataframe

Example: Get the particular colleges with where() clause




# get college as vignan
dataframe.where((
  dataframe.college).isin(['vignan'])).show()

Output:


Article Tags :