Get specific row from PySpark dataframe
Last Updated :
18 Jul, 2021
In this article, we will discuss how to get the specific row from the PySpark dataframe.
Creating Dataframe for demonstration:
Python3
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName( 'sparkdf' ).getOrCreate()
data = [[ "1" , "sravan" , "company 1" ],
[ "2" , "ojaswi" , "company 2" ],
[ "3" , "bobby" , "company 3" ],
[ "4" , "rohith" , "company 2" ],
[ "5" , "gnanesh" , "company 1" ]]
columns = [ 'Employee ID' , 'Employee NAME' ,
'Company Name' ]
dataframe = spark.createDataFrame(data, columns)
dataframe.show()
|
Output:
Method 1: Using collect()
This is used to get the all row’s data from the dataframe in list format.
Syntax: dataframe.collect()[index_position]
Where,
- dataframe is the pyspark dataframe
- index_position is the index row in dataframe
Example: Python code to access rows
Python3
print (dataframe.collect()[ 0 ])
print (dataframe.collect()[ 1 ])
print (dataframe.collect()[ - 1 ])
print (dataframe.collect()[ 2 ])
|
Output:
Row(Employee ID=’1′, Employee NAME=’sravan’, Company Name=’company 1′)
Row(Employee ID=’2′, Employee NAME=’ojaswi’, Company Name=’company 2′)
Row(Employee ID=’5′, Employee NAME=’gnanesh’, Company Name=’company 1′)
Row(Employee ID=’3′, Employee NAME=’bobby’, Company Name=’company 3′)
Method 2: Using show()
This function is used to get the top n rows from the pyspark dataframe.
Syntax: dataframe.show(no_of_rows)
where, no_of_rows is the row number to get the data
Example: Python code to get the data using show() function
Python3
print (dataframe.show( 2 ))
print (dataframe.show( 1 ))
print (dataframe.show())
|
Output:
Method 3: Using first()
This function is used to return only the first row in the dataframe.
Syntax: dataframe.first()
Example: Python code to select the first row in the dataframe.
Output:
Row(Employee ID=’1′, Employee NAME=’sravan’, Company Name=’company 1′)
Method 4: Using head()
This method is used to display top n rows in the dataframe.
Syntax: dataframe.head(n)
where, n is the number of rows to be displayed
Example: Python code to display the number of rows to be displayed.
Python3
print (dataframe.head( 1 ))
print (dataframe.head( 3 ))
print (dataframe.head( 2 ))
|
Output:
[Row(Employee ID=’1′, Employee NAME=’sravan’, Company Name=’company 1′)]
[Row(Employee ID=’1′, Employee NAME=’sravan’, Company Name=’company 1′),
Row(Employee ID=’2′, Employee NAME=’ojaswi’, Company Name=’company 2′),
Row(Employee ID=’3′, Employee NAME=’bobby’, Company Name=’company 3′)]
[Row(Employee ID=’1′, Employee NAME=’sravan’, Company Name=’company 1′),
Row(Employee ID=’2′, Employee NAME=’ojaswi’, Company Name=’company 2′)]
Method 5: Using tail()
Used to return last n rows in the dataframe
Syntax: dataframe.tail(n)
where n is the no of rows to be returned from last in the dataframe.
Example: Python code to get last n rows
Python3
print (dataframe.tail( 1 ))
print (dataframe.tail( 3 ))
print (dataframe.tail( 2 ))
|
Output:
[Row(Employee ID=’5′, Employee NAME=’gnanesh’, Company Name=’company 1′)]
[Row(Employee ID=’3′, Employee NAME=’bobby’, Company Name=’company 3′),
Row(Employee ID=’4′, Employee NAME=’rohith’, Company Name=’company 2′),
Row(Employee ID=’5′, Employee NAME=’gnanesh’, Company Name=’company 1′)]
[Row(Employee ID=’4′, Employee NAME=’rohith’, Company Name=’company 2′),
Row(Employee ID=’5′, Employee NAME=’gnanesh’, Company Name=’company 1′)]
Method 6: Using select() with collect() method
This method is used to select a particular row from the dataframe, It can be used with collect() function.
Syntax: dataframe.select([columns]).collect()[index]
where,
- dataframe is the pyspark dataframe
- Columns is the list of columns to be displayed in each row
- Index is the index number of row to be displayed.
Example: Python code to select the particular row.
Python3
print (dataframe.select([ 'Employee ID' ,
'Employee NAME' ,
'Company Name' ]).collect()[ 0 ])
print (dataframe.select([ 'Employee ID' ,
'Employee NAME' ,
'Company Name' ]).collect()[ 2 ])
print (dataframe.select([ 'Employee ID' ,
'Employee NAME' ,
'Company Name' ]).collect()[ 3 ])
|
Output:
Row(Employee ID=’1′, Employee NAME=’sravan’, Company Name=’company 1′)
Row(Employee ID=’3′, Employee NAME=’bobby’, Company Name=’company 3′)
Row(Employee ID=’4′, Employee NAME=’rohith’, Company Name=’company 2′)
Method 7: Using take() method
This method is also used to select top n rows
Syntax: dataframe.take(n)
where n is the number of rows to be selected
Python3
print (dataframe.take( 2 ))
print (dataframe.take( 4 ))
print (dataframe.take( 1 ))
|
Output:
[Row(Employee ID=’1′, Employee NAME=’sravan’, Company Name=’company 1′),
Row(Employee ID=’2′, Employee NAME=’ojaswi’, Company Name=’company 2′)]
[Row(Employee ID=’1′, Employee NAME=’sravan’, Company Name=’company 1′),
Row(Employee ID=’2′, Employee NAME=’ojaswi’, Company Name=’company 2′),
Row(Employee ID=’3′, Employee NAME=’bobby’, Company Name=’company 3′),
Row(Employee ID=’4′, Employee NAME=’rohith’, Company Name=’company 2′)]
[Row(Employee ID=’1′, Employee NAME=’sravan’, Company Name=’company 1′)]
Like Article
Suggest improvement
Share your thoughts in the comments
Please Login to comment...