Python PySpark – Drop columns based on column names or String condition

Last Updated : 27 Mar, 2023

In this article, we will be looking at the step-wise approach to dropping columns based on column names or String conditions in PySpark.

Stepwise Implementation

Step1: Create CSV

Under this step, we are simply creating a CSV file with three rows and columns.

CSV Used:

Step 2: Import PySpark Library

Under this step, we are importing the PySpark packages to use its functionality by using the below syntax:

import pyspark

Step 3: Start a SparkSession

In this step we are simply starting our spark session using the SparkSession.builder.appName() function.

Python3

from pyspark.sql import SparkSession
 
spark = SparkSession.builder.appName(
    'GeeksForGeeks').getOrCreate()  # You can use any appName
print(spark)

Output:

Step 4: Read our CSV

To read our CSV we use spark.read.csv(). It has 2 parameters:

header = True [Sets column names to First row in the CSV]
inferSchema = True [Sets the right datatypes for the column elements]

Python3

df = spark.read.csv('book1.csv', header=True, inferSchema=True)
df.show()

Output:

Step 5: Drop Column based on Column Name

Finally, we can see how simple it is to Drop a Column based on the Column Name.

To Drop a column we use DataFrame.drop(). And to the result to it, we will see that the Gender column is now not part of the Dataframe. see

Python3

df = df.drop("Gender")
df.show()

Suggest improvement

Show distinct column values in PySpark dataframe

Compute the logarithm base n with scimath in Python

Share your thoughts in the comments

Python PySpark – Drop columns based on column names or String condition

Stepwise Implementation

Step1: Create CSV

Step 2: Import PySpark Library

Step 3: Start a SparkSession

Python3

Step 4: Read our CSV

Python3

Step 5: Drop Column based on Column Name

Python3

Please Login to comment...

Similar Reads

What kind of Experience do you want to share?