In this article, we will discuss how to select only numeric or string column names from a Spark DataFrame.
Methods Used:
- createDataFrame: This method is used to create a spark DataFrame.
- isinstance: This is a Python function used to check if the specified object is of the specified type.
- dtypes: It returns a list of tuple (columnName,type). The returned list contains all columns present in DataFrame with their data types.
- schema.fields: It is used to access DataFrame fields metadata.
Method #1:
In this method, dtypes function is used to get a list of tuple (columnName, type).
Python3
from pyspark.sql import Row
from datetime import date
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame([
Row(a = 1 , b = 'string1' , c = date( 2021 , 1 , 1 )),
Row(a = 2 , b = 'string2' , c = date( 2021 , 2 , 1 )),
Row(a = 4 , b = 'string3' , c = date( 2021 , 3 , 1 ))
])
print ( "DataFrame structure:" , df)
dt = df.dtypes
print ( "dtypes result:" , dt)
columnList = [item[ 0 ] for item in dt if item[ 1 ].startswith(
'string' ) or item[ 1 ].startswith( 'bigint' )]
print ( "Result: " , columnList)
|
Output:
DataFrame structure: DataFrame[a: bigint, b: string, c: date]
dtypes result: [('a', 'bigint'), ('b', 'string'), ('c', 'date')]
Result: ['a', 'b']
Method #2:
In this method schema.fields is used to get fields metadata then column data type is extracted from metadata and compared with the desired data type.
Python3
from pyspark.sql.types import StringType, LongType
from pyspark.sql import Row
from datetime import date
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame([
Row(a = 1 , b = 'string1' , c = date( 2021 , 1 , 1 )),
Row(a = 2 , b = 'string2' , c = date( 2021 , 2 , 1 )),
Row(a = 4 , b = 'string3' , c = date( 2021 , 3 , 1 ))
])
print ( "DataFrame structure:" , df)
meta = df.schema.fields
print ( "Metadata: " , meta)
columnList = [field.name for field in df.schema.fields if isinstance (
field.dataType, StringType) or isinstance (field.dataType, LongType)]
print ( "Result: " , columnList)
|
Output:
DataFrame structure: DataFrame[a: bigint, b: string, c: date]
Metadata: [StructField(a,LongType,true), StructField(b,StringType,true), StructField(c,DateType,true)]
Result: [‘a’, ‘b’]
Whether you're preparing for your first job interview or aiming to upskill in this ever-evolving tech landscape,
GeeksforGeeks Courses are your key to success. We provide top-quality content at affordable prices, all geared towards accelerating your growth in a time-bound manner. Join the millions we've already empowered, and we're here to do the same for you. Don't miss out -
check it out now!
Last Updated :
22 Mar, 2023
Like Article
Save Article