PySpark dataframe add column based on other columns
In this article, we are going to see how to add columns based on another column to the Pyspark Dataframe.
Creating Dataframe for demonstration:
Attention geek! Strengthen your foundations with the Python Programming Foundation Course and learn the basics.
To begin with, your interview preparations Enhance your Data Structures concepts with the Python DS Course. And to begin with your Machine Learning Journey, join the Machine Learning - Basic Level Course
Here we are going to create a dataframe from a list of the given dataset.
Method 1: Using withColumns()
It is used to change the value, convert the datatype of an existing column, create a new column, and many more.
Syntax: df.withColumn(colName, col)
Returns: A new :class:`DataFrame` by adding a column or replacing the existing column that has the same name.
Method 2: Using SQL query
Here we will use SQL query inside the Pyspark, We will create a temp view of the table with the help of createTempView() and the life of this temp is up to the life of the sparkSession. registerTempTable() will create the temp table if it is not available or if it is available then replace it.
Then after creating the table select the table by SQL clause which will take all the values as a string.
Method 3: Using UDF
In this method, we will define the user define a function that will take two parameters and return the total price. This function allows us to create a new function as per our requirements.
Now we define the data type of the UDF function and create the functions which will return the values which is the sum of all values in the row.