Get specific row from PySpark dataframe

Last Updated : 18 Jul, 2021

In this article, we will discuss how to get the specific row from the PySpark dataframe.

Creating Dataframe for demonstration:

Python3

# importing module 
import pyspark 
  
# importing sparksession 
# from pyspark.sql module 
from pyspark.sql import SparkSession 
  
# creating sparksession 
# and giving an app name 
spark = SparkSession.builder.appName('sparkdf').getOrCreate() 
  
# list  of employee data with 5 row values 
data = [["1", "sravan", "company 1"], 
        ["2", "ojaswi", "company 2"], 
        ["3", "bobby", "company 3"], 
        ["4", "rohith", "company 2"], 
        ["5", "gnanesh", "company 1"]] 
  
# specify column names 
columns = ['Employee ID', 'Employee NAME', 
           'Company Name'] 
  
# creating a dataframe from the lists of data 
dataframe = spark.createDataFrame(data, columns) 
  
# display dataframe 
dataframe.show() 

Output:

Method 1: Using collect()

This is used to get the all row’s data from the dataframe in list format.

Syntax: dataframe.collect()[index_position]

Where,

dataframe is the pyspark dataframe

index_position is the index row in dataframe

Example: Python code to access rows

Python3

# get first row 
print(dataframe.collect()[0]) 
  
# get second row 
print(dataframe.collect()[1]) 
  
# get last row 
print(dataframe.collect()[-1]) 
  
# get third row 
print(dataframe.collect()[2]) 

Output:

Row(Employee ID=’1′, Employee NAME=’sravan’, Company Name=’company 1′)

Row(Employee ID=’2′, Employee NAME=’ojaswi’, Company Name=’company 2′)

Row(Employee ID=’5′, Employee NAME=’gnanesh’, Company Name=’company 1′)

Row(Employee ID=’3′, Employee NAME=’bobby’, Company Name=’company 3′)

Method 2: Using show()

This function is used to get the top n rows from the pyspark dataframe.

Syntax: dataframe.show(no_of_rows)

where, no_of_rows is the row number to get the data

Example: Python code to get the data using show() function

Python3

# display dataframe only top 2 rows 
print(dataframe.show(2)) 
  
# display dataframe only top 1 row 
print(dataframe.show(1)) 
  
# display dataframe  
print(dataframe.show()) 

Output:

Method 3: Using first()

This function is used to return only the first row in the dataframe.

Syntax: dataframe.first()

Example: Python code to select the first row in the dataframe.

Python3

# display first row of the dataframe 
print(dataframe.first()) 

Output:

Row(Employee ID=’1′, Employee NAME=’sravan’, Company Name=’company 1′)

Method 4: Using head()

This method is used to display top n rows in the dataframe.

Syntax: dataframe.head(n)

where, n is the number of rows to be displayed

Example: Python code to display the number of rows to be displayed.

Python3

# display only 1 row 
print(dataframe.head(1)) 
  
# display only top 3  rows 
print(dataframe.head(3)) 
  
# display only top 2 rows 
print(dataframe.head(2)) 

Output:

[Row(Employee ID=’1′, Employee NAME=’sravan’, Company Name=’company 1′)]

[Row(Employee ID=’1′, Employee NAME=’sravan’, Company Name=’company 1′),

Row(Employee ID=’2′, Employee NAME=’ojaswi’, Company Name=’company 2′),

Row(Employee ID=’3′, Employee NAME=’bobby’, Company Name=’company 3′)]

[Row(Employee ID=’1′, Employee NAME=’sravan’, Company Name=’company 1′),

Row(Employee ID=’2′, Employee NAME=’ojaswi’, Company Name=’company 2′)]

Method 5: Using tail()

Used to return last n rows in the dataframe

Syntax: dataframe.tail(n)

where n is the no of rows to be returned from last in the dataframe.

Example: Python code to get last n rows

Python3

# display only 1 row from last 
print(dataframe.tail(1)) 
  
# display only top 3  rows from last 
print(dataframe.tail(3)) 
  
# display only top 2 rows from last 
print(dataframe.tail(2)) 

Output:

[Row(Employee ID=’5′, Employee NAME=’gnanesh’, Company Name=’company 1′)]

[Row(Employee ID=’3′, Employee NAME=’bobby’, Company Name=’company 3′),

Row(Employee ID=’4′, Employee NAME=’rohith’, Company Name=’company 2′),

Row(Employee ID=’5′, Employee NAME=’gnanesh’, Company Name=’company 1′)]

[Row(Employee ID=’4′, Employee NAME=’rohith’, Company Name=’company 2′),

Row(Employee ID=’5′, Employee NAME=’gnanesh’, Company Name=’company 1′)]

Method 6: Using select() with collect() method

This method is used to select a particular row from the dataframe, It can be used with collect() function.

Syntax: dataframe.select([columns]).collect()[index]

where,

dataframe is the pyspark dataframe

Columns is the list of columns to be displayed in each row

Index is the index number of row to be displayed.

Example: Python code to select the particular row.

Python3

# select first row 
print(dataframe.select(['Employee ID', 
                        'Employee NAME', 
                        'Company Name']).collect()[0]) 
  
# select third row 
print(dataframe.select(['Employee ID', 
                        'Employee NAME', 
                        'Company Name']).collect()[2]) 
  
# select forth row 
print(dataframe.select(['Employee ID', 
                        'Employee NAME', 
                        'Company Name']).collect()[3]) 

Output:

Row(Employee ID=’1′, Employee NAME=’sravan’, Company Name=’company 1′)

Row(Employee ID=’3′, Employee NAME=’bobby’, Company Name=’company 3′)

Row(Employee ID=’4′, Employee NAME=’rohith’, Company Name=’company 2′)

Method 7: Using take() method

This method is also used to select top n rows

Syntax: dataframe.take(n)

where n is the number of rows to be selected

Python3

# select top 2 rows 
print(dataframe.take(2)) 
  
# select top 4 rows 
print(dataframe.take(4)) 
  
# select top 1 row 
print(dataframe.take(1))

Output:

[Row(Employee ID=’1′, Employee NAME=’sravan’, Company Name=’company 1′),

Row(Employee ID=’2′, Employee NAME=’ojaswi’, Company Name=’company 2′)]

[Row(Employee ID=’1′, Employee NAME=’sravan’, Company Name=’company 1′),

Row(Employee ID=’2′, Employee NAME=’ojaswi’, Company Name=’company 2′),

Row(Employee ID=’3′, Employee NAME=’bobby’, Company Name=’company 3′),

Row(Employee ID=’4′, Employee NAME=’rohith’, Company Name=’company 2′)]

[Row(Employee ID=’1′, Employee NAME=’sravan’, Company Name=’company 1′)]

Suggest improvement

Sort the PySpark DataFrame columns by Ascending or Descending order

Convert Python Dictionary List to PySpark DataFrame

Share your thoughts in the comments

Get specific row from PySpark dataframe

Python3

Method 1: Using collect()

Python3

Method 2: Using show()

Python3

Method 3: Using first()

Python3

Method 4: Using head()

Python3

Method 5: Using tail()

Python3

Method 6: Using select() with collect() method

Python3

Method 7: Using take() method

Python3

Please Login to comment...

Similar Reads

What kind of Experience do you want to share?