How to select a range of rows from a dataframe in PySpark ?
Last Updated :
18 Jul, 2022
In this article, we are going to select a range of rows from a PySpark dataframe.
It can be done in these ways:
- Using filter().
- Using where().
- Using SQL expression.
Creating Dataframe for demonstration:
Python3
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName( 'sparkdf' ).getOrCreate()
data = [[ "1" , "sravan" , "vignan" , 67 , 89 ],
[ "2" , "ojaswi" , "vvit" , 78 , 89 ],
[ "3" , "rohith" , "vvit" , 100 , 80 ],
[ "4" , "sridevi" , "vignan" , 78 , 80 ],
[ "1" , "sravan" , "vignan" , 89 , 98 ],
[ "5" , "gnanesh" , "iit" , 94 , 98 ]]
columns = [ 'student ID' , 'student NAME' ,
'college' , 'subject1' , 'subject2' ]
dataframe = spark.createDataFrame(data, columns)
dataframe.show()
|
Output:
Method 1: Using filter()
This function is used to filter the dataframe by selecting the records based on the given condition.
Syntax: dataframe.filter(condition)
Example: Python code to select the dataframe based on subject2 column.
Python3
dataframe. filter (
dataframe.subject1.between( 23 , 78 )).show()
|
Output:
Method 2: Using where()
This function is used to filter the dataframe by selecting the records based on the given condition.
Syntax: dataframe.where(condition)
Example 1: Python program to select dataframe based on subject1 column.
Python3
dataframe.where(
dataframe.subject1.between( 85 , 100 )).show()
|
Output:
Example 2: Select rows in dataframe by college column
Python3
dataframe.where(
dataframe.college.between( "vvit" , "vvit" )).collect()
|
Output:
[Row(ID=’2′, student NAME=’ojaswi’, college=’vvit’, subject1=78, subject2=89),
Row(ID=’3′, student NAME=’rohith’, college=’vvit’, subject1=100, subject2=80)]
Method 3: Using SQL Expression
By using SQL query with between() operator we can get the range of rows.
Syntax: spark.sql(“SELECT * FROM my_view WHERE column_name between value1 and value2”)
Example 1: Python program to select rows from dataframe based on subject2 column
Python3
dataframe.createOrReplaceTempView( "my_view" )
spark.sql("SELECT * FROM my_view WHERE\
subject1 between 23 and 78 ").collect()
|
Output:
[Row(student ID=’1′, student NAME=’sravan’, college=’vignan’, subject1=67, subject2=89),
Row(student ID=’2′, student NAME=’ojaswi’, college=’vvit’, subject1=78, subject2=89),
Row(student ID=’4′, student NAME=’sridevi’, college=’vignan’, subject1=78, subject2=80)]
Example 2: Select based on ID
Python3
dataframe.createOrReplaceTempView( "my_view" )
spark.sql("SELECT * FROM my_view WHERE\
ID between 1 and 3 ").collect()
|
Output:
[Row(ID=’1′, student NAME=’sravan’, college=’vignan’, subject1=67, subject2=89),
Row(ID=’2′, student NAME=’ojaswi’, college=’vvit’, subject1=78, subject2=89),
Row(ID=’3′, student NAME=’rohith’, college=’vvit’, subject1=100, subject2=80),
Row(ID=’1′, student NAME=’sravan’, college=’vignan’, subject1=89, subject2=98)]
Like Article
Suggest improvement
Share your thoughts in the comments
Please Login to comment...