In this article, we are going to extract all columns except a set of columns or one column from Pyspark dataframe. For this, we will use the select(), drop() functions.
But first, let’s create Dataframe for demonestration.
# importing module import pyspark
# importing sparksession from pyspark.sql module from pyspark.sql import SparkSession
# creating sparksession and giving an app name spark = SparkSession.builder.appName( 'sparkdf' ).getOrCreate()
# list of students data data = [[ "1" , "sravan" , "vignan" ],
[ "2" , "ojaswi" , "vvit" ],
[ "3" , "rohith" , "vvit" ],
[ "4" , "sridevi" , "vignan" ],
[ "1" , "sravan" , "vignan" ],
[ "5" , "gnanesh" , "iit" ]]
# specify column names columns = [ 'student ID' , 'student NAME' , 'college' ]
# creating a dataframe from the lists of data dataframe = spark.createDataFrame(data, columns)
print ( 'Actual data in dataframe' )
dataframe.show() |
Output:
Method 1: Using drop() function
drop() is used to drop the columns from the dataframe.
Syntax: dataframe.drop(‘column_names’)
Where dataframe is the input dataframe and column names are the columns to be dropped
Example: Python program to select data by dropping one column
# drop student id dataframe.drop( 'student ID' ).show()
|
Output:
Example 2: Python program to drop more than one column(set of columns)
# drop student id and college dataframe.drop( 'student ID' , 'college' ).show()
|
Output:
Method 2: Using select() function
This function is used to select the columns from the dataframe
Syntax: dataframe.select(columns)
Where dataframe is the input dataframe and columns are the input columns
Example 1: Select one column from the dataframe.
# select student id dataframe.select( 'student ID' ).show()
|
Output:
Example 2: Python program to select two columns id and name
# select student id and student name dataframe.select( 'student ID' , 'student NAME' ).show()
|
Output: