Open In App

How to Change Column Type in PySpark Dataframe ?

Last Updated : 18 Jul, 2021
Improve
Improve
Like Article
Like
Save
Share
Report

In this article, we are going to see how to change the column type of pyspark dataframe.

Creating dataframe for demonstration:

Python




# Create a spark session
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('SparkExamples').getOrCreate()
  
# Create a spark dataframe
columns = ["Name", "Course_Name",
           "Duration_Months",
           "Course_Fees", "Start_Date",
           "Payment_Done"]
data = [
    ("Amit Pathak", "Python", 3,
     10000, "02-07-2021", True),
    ("Shikhar Mishra", "Soft skills",
     2, 8000, "07-10-2021", False),
    ("Shivani Suvarna", "Accounting",
     6, 15000, "20-08-2021", True),
    ("Pooja Jain", "Data Science", 12,
     60000, "02-12-2021", False),
]
course_df = spark.createDataFrame(data).toDF(*columns)
  
# View the dataframe
course_df.show()


Output:

Let’s see the schema of dataframe:

Python




# View the column datatypes
course_df.printSchema()


Output:

Method 1: Using DataFrame.withColumn()

The DataFrame.withColumn(colName, col) returns a new DataFrame by adding a column or replacing the existing column that has the same name.

We will make use of cast(x, dataType) method to casts the column to a different data type. Here, the parameter “x” is the column name and dataType is the datatype in which you want to change the respective column to.

Example 1: Change datatype of single columns.

Python




# Cast Course_Fees from integer type to float type
course_df2 = course_df.withColumn("Course_Fees"
                                  course_df["Course_Fees"]
                                  .cast('float'))
course_df2.printSchema()


Output:

root
 |-- Name: string (nullable = true)
 |-- Course_Name: string (nullable = true)
 |-- Duration_Months: long (nullable = true)
 |-- Course_Fees: float (nullable = true)
 |-- Start_Date: string (nullable = true)
 |-- Payment_Done: boolean (nullable = true)

In the above example, we can observe that the “Course_Fees” column datatype is changed to float from long. 

Example 2: Change datatype of multiple columns.

Python




# We can also make use of datatypes from 
# pyspark.sql.types
from pyspark.sql.types import StringType, DateType, FloatType
  
course_df3 = course_df \
  .withColumn("Course_Fees" ,
              course_df["Course_Fees"]
              .cast(FloatType()))   \
  .withColumn("Payment_Done",
              course_df["Payment_Done"]
              .cast(StringType()))    \
  .withColumn("Start_Date"  ,
              course_df["Start_Date"]
              .cast(DateType())) \
  
course_df3.printSchema()


Output:

root
 |-- Name: string (nullable = true)
 |-- Course_Name: string (nullable = true)
 |-- Duration_Months: long (nullable = true)
 |-- Course_Fees: float (nullable = true)
 |-- Start_Date: date (nullable = true)
 |-- Payment_Done: string (nullable = true)

In the above example, we changed the datatype of columns “Course_Fees”, “Payment_Done”, and “Start_Date” to “float”, “str” and “datetype” respectively.

Method 2: Using DataFrame.select()

Here we will use select() function, this function is used to select the columns from the dataframe

Syntax: dataframe.select(columns)

Where dataframe is the input dataframe and columns are the input columns

Example 1: Change a single column.

Let us convert the `course_df3` from the above schema structure, back to the original schema.

Python




from pyspark.sql.types import StringType, BooleanType, IntegerType
  
course_df4 = course_df3.select(
    course_df3.Name,
    course_df3.Course_Name,
    course_df3.Duration_Months,
    (course_df3.Course_Fees.cast(IntegerType()))
    .alias('Course_Fees'),
    (course_df3.Start_Date.cast(StringType()))
    .alias('Start_Date'),
    (course_df3.Payment_Done.cast(BooleanType()))
    .alias('Payment_Done'),
)
  
course_df4.printSchema()


Output:

root
 |-- Name: string (nullable = true)
 |-- Course_Name: string (nullable = true)
 |-- Duration_Months: long (nullable = true)
 |-- Course_Fees: integer (nullable = true)
 |-- Start_Date: string (nullable = true)
 |-- Payment_Done: boolean (nullable = true)

Example 2: Changing multiple columns to the same datatype.

Python




# Changing datatype of all the columns
# to string type
from pyspark.sql.types import StringType
  
course_df5 = course_df.select(
  [course_df.cast(StringType())
   .alias(c) for c in course_df.columns]
)
course_df5.printSchema()


Output:

root
 |-- Name: string (nullable = true)
 |-- Course_Name: string (nullable = true)
 |-- Duration_Months: string (nullable = true)
 |-- Course_Fees: string (nullable = true)
 |-- Start_Date: string (nullable = true)
 |-- Payment_Done: string (nullable = true)

Example 3: Changing multiple columns to the different datatypes.

Let us use the `course_df5` which has all the column type as `string`. We will change the column types to a respective format.

Python




from pyspark.sql.types import (
    StringType, BooleanType, IntegerType, FloatType, DateType
)
  
coltype_map = {
    "Name": StringType(),
    "Course_Name": StringType(),
    "Duration_Months": IntegerType(),
    "Course_Fees": FloatType(),
    "Start_Date": DateType(),
    "Payment_Done": BooleanType(),
}
  
# course_df6 has all the column
# types as string
course_df6 = course_df5.select(
    [course_df5.cast(coltype_map)
     .alias(c) for c in course_df5.columns]
)
course_df6.printSchema()


Output:

root
 |-- Name: string (nullable = true)
 |-- Course_Name: string (nullable = true)
 |-- Duration_Months: integer (nullable = true)
 |-- Course_Fees: float (nullable = true)
 |-- Start_Date: date (nullable = true)
 |-- Payment_Done: boolean (nullable = true)

Method 3: Using spark.sql()

Here we will use SQL query to change the column type.

Syntax: spark.sql(“sql Query”)

Example: Using spark.sql()

Python




# course_df5 has all the column datatypes as string
course_df5.createOrReplaceTempView("course_view")
  
course_df7 = spark.sql('''
SELECT 
  Name,
  Course_Name,
  INT(Duration_Months),
  FLOAT(Course_Fees),
  DATE(Start_Date),
  BOOLEAN(Payment_Done)
FROM course_view
''')
  
course_df7.printSchema()


Output:

root
 |-- Name: string (nullable = true)
 |-- Course_Name: string (nullable = true)
 |-- Duration_Months: integer (nullable = true)
 |-- Course_Fees: float (nullable = true)
 |-- Start_Date: date (nullable = true)
 |-- Payment_Done: boolean (nullable = true)


Like Article
Suggest improvement
Previous
Next
Share your thoughts in the comments

Similar Reads