How to select a range of rows from a dataframe in PySpark ?

Last Updated : 18 Jul, 2022

In this article, we are going to select a range of rows from a PySpark dataframe.

It can be done in these ways:

Using filter().
Using where().
Using SQL expression.

Creating Dataframe for demonstration:

Python3

# importing module
import pyspark
 
# importing sparksession from pyspark.sql module
from pyspark.sql import SparkSession
 
# creating sparksession and giving an app name
spark = SparkSession.builder.appName('sparkdf').getOrCreate()
 
# list  of students  data
data = [["1", "sravan", "vignan", 67, 89],
        ["2", "ojaswi", "vvit", 78, 89],
        ["3", "rohith", "vvit", 100, 80],
        ["4", "sridevi", "vignan", 78, 80],
        ["1", "sravan", "vignan", 89, 98],
        ["5", "gnanesh", "iit", 94, 98]]
 
# specify column names
columns = ['student ID', 'student NAME',
           'college', 'subject1', 'subject2']
 
# creating a dataframe from the lists of data
dataframe = spark.createDataFrame(data, columns)
 
# display dataframe
dataframe.show()

Output:

Method 1: Using filter()

This function is used to filter the dataframe by selecting the records based on the given condition.

Syntax: dataframe.filter(condition)

Example: Python code to select the dataframe based on subject2 column.

Python3

# select dataframe between
# 23 and 78 marks in subject2 
dataframe.filter(
  dataframe.subject1.between(23,78)).show()

Output:

Method 2: Using where()

This function is used to filter the dataframe by selecting the records based on the given condition.

Syntax: dataframe.where(condition)

Example 1: Python program to select dataframe based on subject1 column.

Python3

# select dataframe between
# 85 and 100 in subject1 column
dataframe.where(
  dataframe.subject1.between(85,100)).show()

Output:

Example 2: Select rows in dataframe by college column

Python3

# select dataframe in college column 
# for vvit
dataframe.where(
  dataframe.college.between("vvit","vvit")).collect()

Output:

[Row(ID=’2′, student NAME=’ojaswi’, college=’vvit’, subject1=78, subject2=89),

Row(ID=’3′, student NAME=’rohith’, college=’vvit’, subject1=100, subject2=80)]

Method 3: Using SQL Expression

By using SQL query with between() operator we can get the range of rows.

Syntax: spark.sql(“SELECT * FROM my_view WHERE column_name between value1 and value2”)

Example 1: Python program to select rows from dataframe based on subject2 column

Python3

# create view for the dataframe
dataframe.createOrReplaceTempView("my_view")
 
# data subject1 between 23 and 78
spark.sql("SELECT * FROM my_view WHERE\
subject1 between 23 and 78").collect()

Output:

[Row(student ID=’1′, student NAME=’sravan’, college=’vignan’, subject1=67, subject2=89),

Row(student ID=’2′, student NAME=’ojaswi’, college=’vvit’, subject1=78, subject2=89),

Row(student ID=’4′, student NAME=’sridevi’, college=’vignan’, subject1=78, subject2=80)]

Example 2: Select based on ID

Python3

# create view for the dataframe
dataframe.createOrReplaceTempView("my_view")
 
# data subject1 between 23 and 78
spark.sql("SELECT * FROM my_view WHERE\
ID between 1 and 3").collect()

Output:

[Row(ID=’1′, student NAME=’sravan’, college=’vignan’, subject1=67, subject2=89),

Row(ID=’2′, student NAME=’ojaswi’, college=’vvit’, subject1=78, subject2=89),

Row(ID=’3′, student NAME=’rohith’, college=’vvit’, subject1=100, subject2=80),

Row(ID=’1′, student NAME=’sravan’, college=’vignan’, subject1=89, subject2=98)]

Suggest improvement

Show distinct column values in PySpark dataframe

Filtering rows based on column values in PySpark dataframe

Share your thoughts in the comments

How to select a range of rows from a dataframe in PySpark ?

Python3

Python3

Python3

Python3

Python3

Python3

Please Login to comment...

Similar Reads

What kind of Experience do you want to share?