Open In App

Delete rows in PySpark dataframe based on multiple conditions

In this article, we are going to see how to delete rows in PySpark dataframe based on multiple conditions.

Method 1: Using Logical expression

Here we are going to use the logical expression to filter the row. Filter() function is used to filter the rows from RDD/DataFrame based on the given condition or SQL expression.



Syntax: filter( condition)

Parameters: 



  • Condition: Logical condition or SQL expression

Example 1: 




# importing module
import pyspark
  
# importing sparksession from pyspark.sql
# module
from pyspark.sql import SparkSession
  
# spark library import
import pyspark.sql.functions
  
# creating sparksession and giving an app name
spark = SparkSession.builder.appName('sparkdf').getOrCreate()
  
# list  of students  data
data = [["1", "Amit", " DU"],
        ["2", "Mohit", "DU"],
        ["3", "rohith", "BHU"],
        ["4", "sridevi", "LPU"],
        ["1", "sravan", "KLMP"],
        ["5", "gnanesh", "IIT"]]
  
# specify column names
columns = ['student_ID', 'student_NAME', 'college']
  
# creating a dataframe from the lists of data
dataframe = spark.createDataFrame(data, columns)
  
dataframe = dataframe.filter(dataframe.college != "IIT")
  
dataframe.show()

Output:

Example 2:




# importing module
import pyspark
  
# importing sparksession from pyspark.sql
# module
from pyspark.sql import SparkSession
  
# spark library import
import pyspark.sql.functions
  
# creating sparksession and giving an app name
spark = SparkSession.builder.appName('sparkdf').getOrCreate()
  
# list  of students  data
data = [["1", "Amit", " DU"],
        ["2", "Mohit", "DU"],
        ["3", "rohith", "BHU"],
        ["4", "sridevi", "LPU"],
        ["1", "sravan", "KLMP"],
        ["5", "gnanesh", "IIT"]]
  
# specify column names
columns = ['student_ID', 'student_NAME', 'college']
  
# creating a dataframe from the lists of data
dataframe = spark.createDataFrame(data, columns)
  
dataframe = dataframe.filter(
    ((dataframe.college != "DU")
     & (dataframe.student_ID != "3"))
)
  
dataframe.show()

Output:

Method 2: Using when() method

It evaluates a list of conditions and returns a single value. Thus passing the condition and its required values will get the job done.

Syntax: When( Condition, Value)

Parameters:

  • Condition: Boolean or columns expression.
  • Value: Literal Value

Example: 




# importing module
import pyspark
  
# importing sparksession from pyspark.sql 
# module
from pyspark.sql import SparkSession
  
# spark library import
import pyspark.sql.functions
  
# spark library import
from pyspark.sql.functions import when
  
# creating sparksession and giving an app name
spark = SparkSession.builder.appName('sparkdf').getOrCreate()
  
# list  of students  data
data = [["1", "Amit", " DU"],
        ["2", "Mohit", "DU"],
        ["3", "rohith", "BHU"],
        ["4", "sridevi", "LPU"],
        ["1", "sravan", "KLMP"],
        ["5", "gnanesh", "IIT"]]
  
# specify column names
columns = ['student_ID', 'student_NAME', 'college']
  
# creating a dataframe from the lists of data
dataframe = spark.createDataFrame(data, columns)
  
dataframe.withColumn('New_col',
                     when(dataframe.student_ID != '5', "True")
                     .when(dataframe.student_NAME != 'gnanesh', "True")
                     ).filter("New_col == True").drop("New_col").show()

Output:


Article Tags :