PySpark – Adding a Column from a list of values using a UDF
Last Updated :
30 Jan, 2023
A data frame that is similar to a relational table in Spark SQL, and can be created using various functions in SparkSession is known as a Pyspark data frame. There occur various circumstances in which you get data in the list format but you need it in the form of a column in the data frame. If a similar situation has occurred with you, then you can do it easily by assigning increasing IDs to the data frame and then adding the values in a column. Read the article further to know more about it in detail.
PySpark – Adding a Column from a list of values using a UDF
Example 1:
In the example, we have created a data frame with three columns ‘Roll_Number‘, ‘Fees‘, and ‘Fine‘ as follows:
Once created, we assigned continuously increasing IDs to the data frame using the monotonically_increasing_id function. Also, we defined a list of values, i.e., student_names which need to be added as a column to a data frame. Then, with the UDF on increasing Id’s, we assigned values of the list as a column to the data frame and finally displayed the data frame after dropping the increasing Id’s column.
Python3
from pyspark.sql import SparkSession, functions as F
from pyspark.sql.types import IntegerType, StringType
from pyspark.sql.functions import row_number, monotonically_increasing_id
from pyspark.sql.window import Window
spark_session = SparkSession.builder.getOrCreate()
labels_udf = F.udf( lambda indx: student_names[indx - 1 ], StringType())
df = spark_session.createDataFrame(
[( 1 , 10000 , 400 ), ( 2 , 14000 , 500 ), ( 3 , 12000 , 800 )],
[ 'Roll_Number' , 'Fees' , 'Fine' ])
student_names = [ 'Aman' , 'Ishita' , 'Vinayak' ]
df = df.withColumn( "num_id" , row_number().over(
Window.orderBy(monotonically_increasing_id())))
new_df = df.withColumn( 'Names' , labels_udf( 'num_id' ))
new_df.drop( 'num_id' ).show()
|
Output:
Example 2:
In this example, we have used a data set (link), i.e., basically, a 5*5 data set as follows:
Then, we assigned continuously increasing IDs to the data frame using the monotonically_increasing_id function. Also, we defined a list of values, i.e., fine_data which needs to be added as a column to the data frame. Then, with the UDF on increasing Id’s, we assigned values of the list as a column to the data frame and finally displayed the data frame after dropping the increasing Id’s column.
Python3
from pyspark.sql import SparkSession, functions as F
from pyspark.sql.types import IntegerType, StringType
from pyspark.sql.functions import row_number, monotonically_increasing_id
from pyspark.sql.window import Window
spark_session = SparkSession.builder.getOrCreate()
labels_udf = F.udf( lambda indx: fine_data[indx - 1 ], IntegerType())
df = csv_file = spark_session.read.csv(
'/content/student_data.csv' , sep = ',' , inferSchema = True , header = True )
fine_data = [ 200 , 300 , 400 , 0 , 500 ]
df = df.withColumn( "num_id" , row_number().over(
Window.orderBy(monotonically_increasing_id())))
new_df = df.withColumn( 'fine' , labels_udf( 'num_id' ))
new_df.drop( 'num_id' ).show()
|
Output:
Share your thoughts in the comments
Please Login to comment...