Open In App

Python PySpark – Drop columns based on column names or String condition

Last Updated : 27 Mar, 2023
Improve
Improve
Like Article
Like
Save
Share
Report

In this article, we will be looking at the step-wise approach to dropping columns based on column names or String conditions in PySpark.

Stepwise Implementation

Step1: Create CSV

Under this step, we are simply creating a CSV file with three rows and columns.

CSV Used:

 

Step 2: Import PySpark Library

Under this step, we are importing the PySpark packages to use its functionality by using the below syntax:

import pyspark

Step 3: Start a SparkSession

In this step we are simply starting our spark session using the SparkSession.builder.appName() function.

Python3




from pyspark.sql import SparkSession
 
spark = SparkSession.builder.appName(
    'GeeksForGeeks').getOrCreate()  # You can use any appName
print(spark)


Output:

 

Step 4: Read our CSV

To read our CSV we use spark.read.csv(). It has 2 parameters: 

  • header = True [Sets column names to First row in the CSV]
  • inferSchema = True [Sets the right datatypes for the column elements]

Python3




df = spark.read.csv('book1.csv', header=True, inferSchema=True)
df.show()


Output:

 

Step 5: Drop Column based on Column Name

Finally, we can see how simple it is to Drop a Column based on the Column Name. 

To Drop a column we use DataFrame.drop(). And to the result to it, we will see that the Gender column is now not part of the Dataframe. see

Python3




df = df.drop("Gender")
df.show()


 



Like Article
Suggest improvement
Previous
Next
Share your thoughts in the comments

Similar Reads