Extract First and last N rows from PySpark DataFrame

Last Updated : 06 Jun, 2021

In this article, we are going to get the extract first N rows and Last N rows from the dataframe using PySpark in Python. To do our task first we will create a sample dataframe.

We have to create a spark object with the help of the spark session and give the app name by using getorcreate() method.

spark = SparkSession.builder.appName('sparkdf').getOrCreate()

Finally, after creating the data with the list and column list to the method:

dataframe = spark.createDataFrame(data, columns)

Python3

# importing module 
import pyspark 
  
# importing sparksession from pyspark.sql module 
from pyspark.sql import SparkSession 
  
# creating sparksession and giving an app name 
spark = SparkSession.builder.appName('sparkdf').getOrCreate() 
  
# list  of employee data with 5 row values 
data = [["1", "sravan", "company 1"], 
        ["2", "ojaswi", "company 2"], 
        ["3", "bobby", "company 3"], 
        ["4", "rohith", "company 2"], 
        ["5", "gnanesh", "company 1"]] 
  
# specify column names 
columns = ['Employee ID', 'Employee NAME', 'Company Name'] 
  
# creating a dataframe from the lists of data 
dataframe = spark.createDataFrame(data, columns) 
  
print('Actual data in dataframe') 
dataframe.show() 

Output:

Extracting first N rows

We can extract the first N rows by using several methods which are discussed below with the help of some examples:

Method 1: Using head()

This function is used to extract top N rows in the given dataframe

Syntax: dataframe.head(n)

where,

n specifies the number of rows to be extracted from first

dataframe is the dataframe name created from the nested lists using pyspark.

Python3

print("Top 2 rows ") 
  
# extract top 2 rows 
a = dataframe.head(2) 
print(a) 
  
print("Top 1 row ") 
  
# extract top 1 row 
a = dataframe.head(1) 
print(a) 

Output:

Top 2 rows

[Row(Employee ID=’1′, Employee NAME=’sravan’, Company Name=’company 1′),

Row(Employee ID=’2′, Employee NAME=’ojaswi’, Company Name=’company 2′)]

Top 1 row

[Row(Employee ID=’1′, Employee NAME=’sravan’, Company Name=’company 1′)]

Method 2: Using first()

This function is used to extract only one row in the dataframe.

Syntax: dataframe.first()

It doesn’t take any parameter

dataframe is the dataframe name created from the nested lists using pyspark

Python3

print("Top row ") 
  
# extract top  row 
a = dataframe.first() 
print(a) 

Output:

Top row

Row(Employee ID=’1′, Employee NAME=’sravan’, Company Name=’company 1′)

Method 3: Using show()

Used to display the dataframe from top to bottom by default.

Syntax: dataframe.show(n)

where,

dataframe is the input dataframe

n is the number of rows to be displayed from the top ,if n is not specified it will print entire rows in the dataframe

Python3

# show() function to get  
# 2 rows 
dataframe.show(2) 

Output:

Extracting Last N rows

Extracting the last rows means getting the last N rows from the given dataframe. For this, we are using tail() function and can get the last N rows

Syntax: dataframe.tail(n)

where,

n is the number to get last n rows

data frame is the input dataframe

Example:

Python3

print("Last 2 rows ") 
  
# extract last 2 rows 
a = dataframe.tail(2) 
print(a) 
  
print("Last 1 row ") 
  
# extract last 1 row 
a = dataframe.tail(1) 
print(a) 

Output:

Last 2 rows

[Row(Employee ID=’4′, Employee NAME=’rohith’, Company Name=’company 2′),

Row(Employee ID=’5′, Employee NAME=’gnanesh’, Company Name=’company 1′)]

Last 1 row

[Row(Employee ID=’5′, Employee NAME=’gnanesh’, Company Name=’company 1′)]