Add new column with default value in PySpark dataframe
Last Updated :
29 Jun, 2021
In this article, we are going to see how to add a new column with a default value in PySpark Dataframe.
The three ways to add a column to PandPySpark as DataFrame with Default Value.
- Using pyspark.sql.DataFrame.withColumn(colName, col)
- Using pyspark.sql.DataFrame.select(*cols)
- Using pyspark.sql.SparkSession.sql(sqlQuery)
Method 1: Using pyspark.sql.DataFrame.withColumn(colName, col)
It Adds a column or replaces the existing column that has the same name to a DataFrame and returns a new DataFrame with all existing columns to new ones. The column expression must be an expression over this DataFrame and adding a column from some other DataFrame will raise an error.
Syntax: pyspark.sql.DataFrame.withColumn(colName, col)
Parameters: This method accepts the following parameter as mentioned above and described below.
- colName: It is a string and contains name of the new column.
- col: It is a Column expression for the new column.
Returns: DataFrame
First, create a simple DataFrame.
Python3
import findspark
findspark.init()
from datetime import datetime, date
import pandas as pd
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
pandas_df = pd.DataFrame({
'Name' : [ 'Anurag' , 'Manjeet' , 'Shubham' ,
'Saurabh' , 'Ujjawal' ],
'Address' : [ 'Patna' , 'Delhi' , 'Coimbatore' ,
'Greater noida' , 'Patna' ],
'ID' : [ 20123 , 20124 , 20145 , 20146 , 20147 ],
'Sell' : [ 140000 , 300000 , 600000 , 200000 , 600000 ]
})
df = spark.createDataFrame(pandas_df)
print ( "Original DataFrame :" )
df.show()
|
Output:
Add a new column with Default Value:
Python3
from pyspark.sql.functions import lit
df = df.withColumn( "Rewards" , lit( None ))
df.show()
df = df.withColumn( "Bonus Percent" , lit( 0.25 ))
df.show()
|
Output:
Method 2: Using pyspark.sql.DataFrame.select(*cols)
We can use pyspark.sql.DataFrame.select() create a new column in DataFrame and set it to default values. It projects a set of expressions and returns a new DataFrame.
Syntax: pyspark.sql.DataFrame.select(*cols)
Parameters: This method accepts the following parameter as mentioned above and described below.
- cols: It contains column names (string) or expressions (Column)
Returns: DataFrame
First, create a simple DataFrame.
Python3
import findspark
findspark.init()
from datetime import datetime, date
import pandas as pd
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
pandas_df = pd.DataFrame({
'Name' : [ 'Anurag' , 'Manjeet' , 'Shubham' ,
'Saurabh' , 'Ujjawal' ],
'Address' : [ 'Patna' , 'Delhi' , 'Coimbatore' ,
'Greater noida' , 'Patna' ],
'ID' : [ 20123 , 20124 , 20145 , 20146 , 20147 ],
'Sell' : [ 140000 , 300000 , 600000 , 200000 , 600000 ]
})
df = spark.createDataFrame(pandas_df)
print ( "Original DataFrame :" )
df.show()
|
Output:
Add a new column with Default Value:
Python3
from pyspark.sql.functions import lit
df = df.select( '*' , lit( None ).alias( "Rewards" ))
df = df.select( '*' , lit( 0.25 ).alias( "Bonus Percent" ))
df.show()
|
Output:
Method 3: Using pyspark.sql.SparkSession.sql(sqlQuery)
We can use pyspark.sql.SparkSession.sql() create a new column in DataFrame and set it to default values. It returns a DataFrame representing the result of the given query.
Syntax: pyspark.sql.SparkSession.sql(sqlQuery)
Parameters: This method accepts the following parameter as mentioned above and described below.
- sqlQuery: It is a string and contains the sql executable query.
Returns: DataFrame
First, create a simple DataFrame:
Python3
import findspark
findspark.init()
from datetime import datetime, date
import pandas as pd
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
pandas_df = pd.DataFrame({
'Name' : [ 'Anurag' , 'Manjeet' , 'Shubham' ,
'Saurabh' , 'Ujjawal' ],
'Address' : [ 'Patna' , 'Delhi' , 'Coimbatore' ,
'Greater noida' , 'Patna' ],
'ID' : [ 20123 , 20124 , 20145 , 20146 , 20147 ],
'Sell' : [ 140000 , 300000 , 600000 , 200000 , 600000 ]
})
df = spark.createDataFrame(pandas_df)
print ( "Original DataFrame :" )
df.show()
|
Output:
Add a new column with Default Value:
Python3
df.createOrReplaceTempView( "GFG_Table" )
df = spark.sql( "select *, null as Rewards from GFG_Table" )
df.createOrReplaceTempView( "GFG_Table" )
df = spark.sql( "select *, '0.25' as Bonus_Percent from GFG_Table" )
df.show()
|
Output:
Share your thoughts in the comments
Please Login to comment...