Select Columns that Satisfy a Condition in PySpark

In this article, we are going to select columns in the dataframe based on the condition using the where() function in Pyspark.

Let’s create a sample dataframe with employee data.

Python3

# importing module 

import pyspark 

# importing sparksession from pyspark.sql module 

from pyspark.sql import SparkSession 

# creating sparksession and giving an app name 

spark = SparkSession.builder.appName('sparkdf').getOrCreate() 

# list  of employee data 

data = [[1, "sravan", "company 1"], [2, "ojaswi", "company 1"], 

        [3, "rohith", "company 2"], [4, "sridevi", "company 1"],  

        [1, "sravan", "company 1"], [4, "sridevi", "company 1"]] 

# specify column names 

columns = ['ID', 'NAME', 'Company'] 

# creating a dataframe from the lists of data 

dataframe = spark.createDataFrame(data, columns) 

# display dataframe 
dataframe.show()

Output:

The where() method

This method is used to return the dataframe based on the given condition. It can take a condition and returns the dataframe

Syntax:

where(dataframe.column condition)

Here dataframe is the input dataframe
The column is the column name where we have to raise a condition

The select() method

After applying the where clause, we will select the data from the dataframe

Syntax:

dataframe.select('column_name').where(dataframe.column condition)

Here dataframe is the input dataframe
The column is the column name where we have to raise a condition

Example 1: Python program to return ID based on condition

Python3

# importing module 

import pyspark 

# importing sparksession from pyspark.sql module 

from pyspark.sql import SparkSession 

# creating sparksession and giving an app name 

spark = SparkSession.builder.appName('sparkdf').getOrCreate() 

# list  of employee data 

data = [[1, "sravan", "company 1"], [2, "ojaswi", "company 1"],  

        [3, "rohith", "company 2"], [4, "sridevi", "company 1"],  

        [1, "sravan", "company 1"], [4, "sridevi", "company 1"]] 

# specify column names 

columns = ['ID', 'NAME', 'Company'] 

# creating a dataframe from the lists of data 

dataframe = spark.createDataFrame(data, columns) 

# select ID where ID less than 3 

dataframe.select('ID').where(dataframe.ID < 3).show()

Output:

Example 2: Python program to select ID and name where ID =4.

Python3

# importing module 

import pyspark 

# importing sparksession from pyspark.sql module 

from pyspark.sql import SparkSession 

# creating sparksession and giving an app name 

spark = SparkSession.builder.appName('sparkdf').getOrCreate() 

# list  of employee data 

data = [[1, "sravan", "company 1"], [2, "ojaswi", "company 1"], 

        [3, "rohith", "company 2"], [4, "sridevi", "company 1"],  

        [1, "sravan", "company 1"], [4, "sridevi", "company 1"]] 

# specify column names 

columns = ['ID', 'NAME', 'Company'] 

# creating a dataframe from the lists of data 

dataframe = spark.createDataFrame(data, columns) 

# select ID and name  where ID =4 

dataframe.select(['ID', 'NAME']).where(dataframe.ID == 4).show()

Output:

Example 3: Python program to select all column based on condition

Python3

# importing module 

import pyspark 

# importing sparksession from pyspark.sql module 

from pyspark.sql import SparkSession 

# creating sparksession and giving an app name 

spark = SparkSession.builder.appName('sparkdf').getOrCreate() 

# list  of employee data 

data = [[1, "sravan", "company 1"], [2, "ojaswi", "company 1"], 

        [3, "rohith", "company 2"], [4, "sridevi", "company 1"],  

        [1, "sravan", "company 1"], [4, "sridevi", "company 1"]] 

# specify column names 

columns = ['ID', 'NAME', 'Company'] 

# creating a dataframe from the lists of data 

dataframe = spark.createDataFrame(data, columns) 

# select all columns e  where name = sridevi 

dataframe.select(['ID', 'NAME', 'Company']).where( 

    dataframe.NAME == 'sridevi').show()

Output:

Article Tags :

Python

Python-Pyspark