Open In App

Add new column with default value in PySpark dataframe

Last Updated : 29 Jun, 2021
Improve
Improve
Like Article
Like
Save
Share
Report

In this article, we are going to see how to add a new column with a default value in PySpark Dataframe.

The three ways to add a column to PandPySpark as DataFrame with Default Value.

  • Using pyspark.sql.DataFrame.withColumn(colName, col)
  • Using pyspark.sql.DataFrame.select(*cols)
  • Using pyspark.sql.SparkSession.sql(sqlQuery)

Method 1: Using pyspark.sql.DataFrame.withColumn(colName, col)

It Adds a column or replaces the existing column that has the same name to a DataFrame and returns a new DataFrame with all existing columns to new ones. The column expression must be an expression over this DataFrame and adding a column from some other DataFrame will raise an error.

Syntax: pyspark.sql.DataFrame.withColumn(colName, col)

Parameters: This method accepts the following parameter as mentioned above and described below.

  • colName: It is a string and contains name of the new column.
  • col: It is a Column expression for the new column.

Returns: DataFrame

First, create a simple DataFrame.

Python3




import findspark
findspark.init()
  
# Importing the modules
from datetime import datetime, date
import pandas as pd
from pyspark.sql import SparkSession
  
# creating the session
spark = SparkSession.builder.getOrCreate()
  
# creating the dataframe
pandas_df = pd.DataFrame({
    'Name': ['Anurag', 'Manjeet', 'Shubham',
             'Saurabh', 'Ujjawal'],
    'Address': ['Patna', 'Delhi', 'Coimbatore',
                'Greater noida', 'Patna'],
    'ID': [20123, 20124, 20145, 20146, 20147],
    'Sell': [140000, 300000, 600000, 200000, 600000]
})
df = spark.createDataFrame(pandas_df)
print("Original DataFrame :")
df.show()


Output:

Add a new column with Default Value:

Python3




# Add new column with NUll
from pyspark.sql.functions import lit
df = df.withColumn("Rewards", lit(None))
df.show()
  
# Add new constanst column
df = df.withColumn("Bonus Percent", lit(0.25))
df.show()


Output:

Method 2: Using pyspark.sql.DataFrame.select(*cols)

We can use pyspark.sql.DataFrame.select() create a new column in DataFrame and set it to default values. It projects a set of expressions and returns a new DataFrame.

Syntax: pyspark.sql.DataFrame.select(*cols)

Parameters: This method accepts the following parameter as mentioned above and described below.

  • cols: It contains column names (string) or expressions (Column)

Returns: DataFrame

First, create a simple DataFrame.

Python3




import findspark
findspark.init()
  
# Importing the modules
from datetime import datetime, date
import pandas as pd
from pyspark.sql import SparkSession
  
# creating the session
spark = SparkSession.builder.getOrCreate()
  
# creating the dataframe
pandas_df = pd.DataFrame({
    'Name': ['Anurag', 'Manjeet', 'Shubham',
             'Saurabh', 'Ujjawal'],
    'Address': ['Patna', 'Delhi', 'Coimbatore',
                'Greater noida', 'Patna'],
    'ID': [20123, 20124, 20145, 20146, 20147],
    'Sell': [140000, 300000, 600000, 200000, 600000]
})
df = spark.createDataFrame(pandas_df)
print("Original DataFrame :")
df.show()


Output:

Add a new column with Default Value:

Python3




# Add new column with NUll
from pyspark.sql.functions import lit
df = df.select('*', lit(None).alias("Rewards"))
  
# Add new constanst column
df = df.select('*', lit(0.25).alias("Bonus Percent"))
df.show()


Output:

Method 3: Using pyspark.sql.SparkSession.sql(sqlQuery)

We can use pyspark.sql.SparkSession.sql() create a new column in DataFrame and set it to default values. It returns a DataFrame representing the result of the given query.

Syntax: pyspark.sql.SparkSession.sql(sqlQuery)

Parameters: This method accepts the following parameter as mentioned above and described below.

  • sqlQuery: It is a string and contains the sql executable query.

Returns: DataFrame

First, create a simple DataFrame:

Python3




import findspark
findspark.init()
  
# Importing the modules
from datetime import datetime, date
import pandas as pd
from pyspark.sql import SparkSession
  
# creating the session
spark = SparkSession.builder.getOrCreate()
  
# creating the dataframe
pandas_df = pd.DataFrame({
    'Name': ['Anurag', 'Manjeet', 'Shubham',
             'Saurabh', 'Ujjawal'],
    'Address': ['Patna', 'Delhi', 'Coimbatore',
                'Greater noida', 'Patna'],
    'ID': [20123, 20124, 20145, 20146, 20147],
    'Sell': [140000, 300000, 600000, 200000, 600000]
})
df = spark.createDataFrame(pandas_df)
print("Original DataFrame :")
df.show()


Output:

Add a new column with Default Value:

Python3




# Add columns to DataFrame using SQL
df.createOrReplaceTempView("GFG_Table")
  
# Add new column with NUll
df=spark.sql("select *, null as Rewards from GFG_Table")
  
# Add new constanst column
df.createOrReplaceTempView("GFG_Table")
df=spark.sql("select *, '0.25' as Bonus_Percent from GFG_Table")
df.show()


Output:



Like Article
Suggest improvement
Share your thoughts in the comments

Similar Reads