Open In App

How to add a constant column in a PySpark DataFrame?

Improve
Improve
Like Article
Like
Save
Share
Report

In this article, we are going to see how to add a constant column in a PySpark Dataframe. 

It can be done in these ways:

  • Using Lit()
  • Using Sql query.

Creating Dataframe for demonstration:

Python3




# Create a spark session
from pyspark.sql import SparkSession
from pyspark.sql.functions import lit
spark = SparkSession.builder.appName('SparkExamples').getOrCreate()
  
# Create a spark dataframe
columns = ["Name", "Course_Name",
           "Months",
           "Course_Fees", "Discount",
           "Start_Date", "Payment_Done"]
data = [
    ("Amit Pathak", "Python", 3,
     10000, 1000, "02-07-2021", True),
    ("Shikhar Mishra", "Soft skills",
     2, 8000, 800, "07-10-2021", False),
    ("Shivani Suvarna", "Accounting", 6,
     15000, 1500, "20-08-2021", True),
    ("Pooja Jain", "Data Science", 12,
     60000, 900, "02-12-2021", False),
]
df = spark.createDataFrame(data).toDF(*columns)
  
# View the dataframe
df.show()


Output:

Method 1: Using lit()

In these methods, we will use the lit() function, Here we can add the constant column ‘literal_values_1’ with value 1 by Using the select method. The lit() function will insert constant values to all the rows. We will use withColumn() select the dataframe:

Syntax: df.withColumn(“NEW_COL”, lit(VALUE))

Example 1: Adding constant value in columns.

Python3




df.withColumn('Status', lit(0)).show()


Output:

Example 2: Adding constant value based on another column.

Python3




from pyspark.sql.functions import when, lit, col
  
df.withColumn(
  "Great_Discount", when(col("Discount") >=1000,lit(
    "Yes")).otherwise(lit("NO"))).show()


Output:

Method 2: Using Sql query

Here we will use sql query inside the Pyspark, We will create a temp view of the table with the help of createTempView() and the life of this temp is up to the life of the sparkSession. registerTempTable() will create the temp table if it is not available or if it is available then replace it.

Then after creating the table select the table by SQL clause which will take all the values as a string.

Python3




df.registerTempTable('table')
newDF = spark.sql('select *, 1 as newCol from table')
newDF.show()


Output:



Last Updated : 23 Aug, 2021
Like Article
Save Article
Previous
Next
Share your thoughts in the comments
Similar Reads