Skip to content
Related Articles

Related Articles

Drop One or Multiple Columns From PySpark DataFrame

Improve Article
Save Article
  • Last Updated : 17 Jun, 2021
Improve Article
Save Article

In this article, we will discuss how to drop columns in the Pyspark dataframe.

In pyspark the drop() function can be used to remove values/columns from the dataframe.

Syntax: dataframe_name.na.drop(how=”any/all”,thresh=threshold_value,subset=[“column_name_1″,”column_name_2”])

  • how – This takes either of the two values ‘any’ or ‘all’.  ‘any’, drop a row if it contains NULLs on any columns and ‘all’, drop a row only if all columns have NULL values. By default it is set to ‘any’
  • thresh – This takes an integer value and drops rows that have less than that thresh hold non-null values. By default it is set to ‘None’.
  • subset – This parameter is used to select a specific column to target the NULL values in it. By default it’s ‘None

Python code to create student dataframe with three columns:

Python3




# importing module
import pyspark
  
# importing sparksession from pyspark.sql module
from pyspark.sql import SparkSession
  
# creating sparksession and giving an app name
spark = SparkSession.builder.appName('sparkdf').getOrCreate()
  
# list  of employee data with 5 row values
data =[["1", "sravan", "company 1"],
       ["3", "bobby", "company 3"],
       ["2", "ojaswi", "company 2"],
       ["1", "sravan", "company 1"],
       ["3", "bobby", "company 3"],
       ["4", "rohith", "company 2"],
       ["5", "gnanesh", "company 1"]]
  
# specify column names
columns = ['Employee ID','Employee NAME','Company Name']
  
# creating a dataframe from the lists of data
dataframe = spark.createDataFrame(data,columns)
  
dataframe.show()

Output:

+-----------+-------------+------------+
|Employee ID|Employee NAME|Company Name|
+-----------+-------------+------------+
|          1|       sravan|   company 1|
|          3|        bobby|   company 3|
|          2|       ojaswi|   company 2|
|          1|       sravan|   company 1|
|          3|        bobby|   company 3|
|          4|       rohith|   company 2|
|          5|      gnanesh|   company 1|
+-----------+-------------+------------+

Example 1: Delete a single column.

Here we are going to delete a single column from the dataframe.

Syntax: dataframe.drop(‘column name’)

Code:

Python3




# delete single column
dataframe = dataframe.drop('Employee ID')
dataframe.show()

Output:

+-------------+------------+
|Employee NAME|Company Name|
+-------------+------------+
|       sravan|   company 1|
|        bobby|   company 3|
|       ojaswi|   company 2|
|       sravan|   company 1|
|        bobby|   company 3|
|       rohith|   company 2|
|      gnanesh|   company 1|
+-------------+------------+Example 2:

Example 2: Delete multiple columns.

Here we will delete multiple columns from the dataframe.

Syntax: dataframe.drop(*(‘column 1′,’column 2′,’column n’))

Code:

Python3




# delete two columns
dataframe = dataframe.drop(*('Employee NAME',
                             'Employee ID'))
dataframe.show()

Output:

+------------+
|Company Name|
+------------+
|   company 1|
|   company 3|
|   company 2|
|   company 1|
|   company 3|
|   company 2|
|   company 1|
+------------+

Example 3: Delete all columns

Here we will delete all the columns from the dataframe, for this we will take column’s name as a list and pass it into drop().

Python3




list = ['Employee ID','Employee NAME','Company Name']
  
# delete two columns
dataframe = dataframe.drop(*list)
dataframe.show()

Output:

++
||
++
||
||
||
||
||
||
||
++

My Personal Notes arrow_drop_up
Related Articles

Start Your Coding Journey Now!