Skip to content
Related Articles

Related Articles

Improve Article

Add new column with default value in PySpark dataframe

  • Last Updated : 29 Jun, 2021

In this article, we are going to see how to add a new column with a default value in PySpark Dataframe.

The three ways to add a column to PandPySpark as DataFrame with Default Value.

  • Using pyspark.sql.DataFrame.withColumn(colName, col)
  • Using pyspark.sql.DataFrame.select(*cols)
  • Using pyspark.sql.SparkSession.sql(sqlQuery)

Method 1: Using pyspark.sql.DataFrame.withColumn(colName, col)

It Adds a column or replaces the existing column that has the same name to a DataFrame and returns a new DataFrame with all existing columns to new ones. The column expression must be an expression over this DataFrame and adding a column from some other DataFrame will raise an error.

Syntax: pyspark.sql.DataFrame.withColumn(colName, col)

Parameters: This method accepts the following parameter as mentioned above and described below.



  • colName: It is a string and contains name of the new column.
  • col: It is a Column expression for the new column.

Returns: DataFrame

First, create a simple DataFrame.

Python3




import findspark
findspark.init()
  
# Importing the modules
from datetime import datetime, date
import pandas as pd
from pyspark.sql import SparkSession
  
# creating the session
spark = SparkSession.builder.getOrCreate()
  
# creating the dataframe
pandas_df = pd.DataFrame({
    'Name': ['Anurag', 'Manjeet', 'Shubham',
             'Saurabh', 'Ujjawal'],
    'Address': ['Patna', 'Delhi', 'Coimbatore',
                'Greater noida', 'Patna'],
    'ID': [20123, 20124, 20145, 20146, 20147],
    'Sell': [140000, 300000, 600000, 200000, 600000]
})
df = spark.createDataFrame(pandas_df)
print("Original DataFrame :")
df.show()

Output:

Add a new column with Default Value:

Python3




# Add new column with NUll
from pyspark.sql.functions import lit
df = df.withColumn("Rewards", lit(None))
df.show()
  
# Add new constanst column
df = df.withColumn("Bonus Percent", lit(0.25))
df.show()

Output:



Method 2: Using pyspark.sql.DataFrame.select(*cols)

We can use pyspark.sql.DataFrame.select() create a new column in DataFrame and set it to default values. It projects a set of expressions and returns a new DataFrame.

Syntax: pyspark.sql.DataFrame.select(*cols)

Parameters: This method accepts the following parameter as mentioned above and described below.

  • cols: It contains column names (string) or expressions (Column)

Returns: DataFrame

First, create a simple DataFrame.

Python3




import findspark
findspark.init()
  
# Importing the modules
from datetime import datetime, date
import pandas as pd
from pyspark.sql import SparkSession
  
# creating the session
spark = SparkSession.builder.getOrCreate()
  
# creating the dataframe
pandas_df = pd.DataFrame({
    'Name': ['Anurag', 'Manjeet', 'Shubham',
             'Saurabh', 'Ujjawal'],
    'Address': ['Patna', 'Delhi', 'Coimbatore',
                'Greater noida', 'Patna'],
    'ID': [20123, 20124, 20145, 20146, 20147],
    'Sell': [140000, 300000, 600000, 200000, 600000]
})
df = spark.createDataFrame(pandas_df)
print("Original DataFrame :")
df.show()

Output:

Add a new column with Default Value:



Python3




# Add new column with NUll
from pyspark.sql.functions import lit
df = df.select('*', lit(None).alias("Rewards"))
  
# Add new constanst column
df = df.select('*', lit(0.25).alias("Bonus Percent"))
df.show()

Output:

Method 3: Using pyspark.sql.SparkSession.sql(sqlQuery)

We can use pyspark.sql.SparkSession.sql() create a new column in DataFrame and set it to default values. It returns a DataFrame representing the result of the given query.

Syntax: pyspark.sql.SparkSession.sql(sqlQuery)

Parameters: This method accepts the following parameter as mentioned above and described below.

  • sqlQuery: It is a string and contains the sql executable query.

Returns: DataFrame

First, create a simple DataFrame:

Python3




import findspark
findspark.init()
  
# Importing the modules
from datetime import datetime, date
import pandas as pd
from pyspark.sql import SparkSession
  
# creating the session
spark = SparkSession.builder.getOrCreate()
  
# creating the dataframe
pandas_df = pd.DataFrame({
    'Name': ['Anurag', 'Manjeet', 'Shubham',
             'Saurabh', 'Ujjawal'],
    'Address': ['Patna', 'Delhi', 'Coimbatore',
                'Greater noida', 'Patna'],
    'ID': [20123, 20124, 20145, 20146, 20147],
    'Sell': [140000, 300000, 600000, 200000, 600000]
})
df = spark.createDataFrame(pandas_df)
print("Original DataFrame :")
df.show()

Output:



Add a new column with Default Value:

Python3




# Add columns to DataFrame using SQL
df.createOrReplaceTempView("GFG_Table")
  
# Add new column with NUll
df=spark.sql("select *, null as Rewards from GFG_Table")
  
# Add new constanst column
df.createOrReplaceTempView("GFG_Table")
df=spark.sql("select *, '0.25' as Bonus_Percent from GFG_Table")
df.show()

Output:

 Attention geek! Strengthen your foundations with the Python Programming Foundation Course and learn the basics.  

To begin with, your interview preparations Enhance your Data Structures concepts with the Python DS Course. And to begin with your Machine Learning Journey, join the Machine Learning – Basic Level Course




My Personal Notes arrow_drop_up
Recommended Articles
Page :