Drop a column with same name using column index in PySpark
Last Updated :
23 Jan, 2023
In this article, we are going to learn how to drop a column with the same name using column index using Pyspark in Python.
Pyspark offers you the essential function ‘drop‘ through which you can easily delete one or more columns. But have you ever got the requirement in which you have various columns with the same column names and the requirement is to delete all the duplicate columns? This can be achieved in Pyspark by obtaining the column index of all the columns with the same name and then deleting those columns using the drop function.
Example 1:
In the example, we have created a data frame with four columns ‘name‘, ‘marks‘, ‘marks‘, ‘marks‘ as follows:
Once created, we got the index of all the columns with the same name, i.e., 2, 3, and added the suffix ‘_duplicate‘ to them using a for a loop. Finally, we removed the columns with suffixes ‘_duplicate‘ in them and displayed the data frame.
Python3
from pyspark.sql import SparkSession
spark_session = SparkSession.builder.getOrCreate()
df = spark_session.createDataFrame(
[( 'Arun' , 1 , 2 , 3 ),( 'Aniket' , 4 , 5 , 6 ),
( 'Ishita' , 7 , 8 , 9 )],
[ 'name' , 'marks' , 'marks' , 'marks' ])
df_cols = df.columns
duplicate_col_index = [idx for idx,
val in enumerate (df_cols) if val in df_cols[:idx]]
for i in duplicate_col_index:
df_cols[i] = df_cols[i] + '_duplicate'
df = df.toDF( * df_cols)
cols_to_remove =
df.drop( * cols_to_remove).show()
|
Output:
Example 2:
In the example, we have created a data frame with five columns with names ‘day’, ‘temperature‘, ‘temperature‘, ‘temperature‘, and ‘temperature‘ as follows:
Once created, we got the index of all the columns with the same name, i.e., 2, 3, 4, and added the prefix ‘day_‘ to them using a for loop. Finally, we removed the columns with the prefixes ‘day_‘ in them and displayed the data frame.
Python3
from pyspark.sql import SparkSession
spark_session = SparkSession.builder.getOrCreate()
df = spark_session.createDataFrame(
[( 'Monday' , 25 , 27 , 29 , 30 ),( 'Tuesday' , 40 , 38 , 36 , 34 ),
( 'Wednesday' , 18 , 20 , 22 , 17 ),( 'Thursday' , 25 , 27 , 29 , 19 )],
[ 'day' , 'temperature' , 'temperature' , 'temperature' ,
'temperature' ])
df_cols = df.columns
duplicate_col_index = [idx for idx,
val in enumerate (df_cols) if val in df_cols[:idx]]
for i in duplicate_col_index:
df_cols[i] = 'day_' + df_cols[i]
df = df.toDF( * df_cols)
cols_to_remove =
df.drop( * cols_to_remove).show()
|
Output:
Like Article
Suggest improvement
Share your thoughts in the comments
Please Login to comment...