In this article, we are going to select columns in the dataframe based on the condition using the where() function in Pyspark.
Let’s create a sample dataframe with employee data.
# importing module import pyspark
# importing sparksession from pyspark.sql module from pyspark.sql import SparkSession
# creating sparksession and giving an app name spark = SparkSession.builder.appName( 'sparkdf' ).getOrCreate()
# list of employee data data = [[ 1 , "sravan" , "company 1" ], [ 2 , "ojaswi" , "company 1" ],
[ 3 , "rohith" , "company 2" ], [ 4 , "sridevi" , "company 1" ],
[ 1 , "sravan" , "company 1" ], [ 4 , "sridevi" , "company 1" ]]
# specify column names columns = [ 'ID' , 'NAME' , 'Company' ]
# creating a dataframe from the lists of data dataframe = spark.createDataFrame(data, columns)
# display dataframe dataframe.show() |
Output:
The where() method
This method is used to return the dataframe based on the given condition. It can take a condition and returns the dataframe
Syntax:
where(dataframe.column condition)
- Here dataframe is the input dataframe
- The column is the column name where we have to raise a condition
The select() method
After applying the where clause, we will select the data from the dataframe
Syntax:
dataframe.select('column_name').where(dataframe.column condition)
- Here dataframe is the input dataframe
- The column is the column name where we have to raise a condition
Example 1: Python program to return ID based on condition
# importing module import pyspark
# importing sparksession from pyspark.sql module from pyspark.sql import SparkSession
# creating sparksession and giving an app name spark = SparkSession.builder.appName( 'sparkdf' ).getOrCreate()
# list of employee data data = [[ 1 , "sravan" , "company 1" ], [ 2 , "ojaswi" , "company 1" ],
[ 3 , "rohith" , "company 2" ], [ 4 , "sridevi" , "company 1" ],
[ 1 , "sravan" , "company 1" ], [ 4 , "sridevi" , "company 1" ]]
# specify column names columns = [ 'ID' , 'NAME' , 'Company' ]
# creating a dataframe from the lists of data dataframe = spark.createDataFrame(data, columns)
# select ID where ID less than 3 dataframe.select( 'ID' ).where(dataframe. ID < 3 ).show()
|
Output:
Example 2: Python program to select ID and name where ID =4.
# importing module import pyspark
# importing sparksession from pyspark.sql module from pyspark.sql import SparkSession
# creating sparksession and giving an app name spark = SparkSession.builder.appName( 'sparkdf' ).getOrCreate()
# list of employee data data = [[ 1 , "sravan" , "company 1" ], [ 2 , "ojaswi" , "company 1" ],
[ 3 , "rohith" , "company 2" ], [ 4 , "sridevi" , "company 1" ],
[ 1 , "sravan" , "company 1" ], [ 4 , "sridevi" , "company 1" ]]
# specify column names columns = [ 'ID' , 'NAME' , 'Company' ]
# creating a dataframe from the lists of data dataframe = spark.createDataFrame(data, columns)
# select ID and name where ID =4 dataframe.select([ 'ID' , 'NAME' ]).where(dataframe. ID = = 4 ).show()
|
Output:
Example 3: Python program to select all column based on condition
# importing module import pyspark
# importing sparksession from pyspark.sql module from pyspark.sql import SparkSession
# creating sparksession and giving an app name spark = SparkSession.builder.appName( 'sparkdf' ).getOrCreate()
# list of employee data data = [[ 1 , "sravan" , "company 1" ], [ 2 , "ojaswi" , "company 1" ],
[ 3 , "rohith" , "company 2" ], [ 4 , "sridevi" , "company 1" ],
[ 1 , "sravan" , "company 1" ], [ 4 , "sridevi" , "company 1" ]]
# specify column names columns = [ 'ID' , 'NAME' , 'Company' ]
# creating a dataframe from the lists of data dataframe = spark.createDataFrame(data, columns)
# select all columns e where name = sridevi dataframe.select([ 'ID' , 'NAME' , 'Company' ]).where(
dataframe.NAME = = 'sridevi' ).show()
|
Output: