Skip to content
Related Articles

Related Articles

Improve Article

How to Add Multiple Columns in PySpark Dataframes ?

  • Last Updated : 30 Jun, 2021
Geek Week

In this article, we will see different ways of adding Multiple Columns in PySpark Dataframes. 

Let’s create a sample dataframe for demonstration:

Dataset Used: Cricket_data_set_odi


# import pandas to read json file
import pandas as pd
# importing module
import pyspark
# importing sparksession from pyspark.sql
# module
from pyspark.sql import SparkSession
# creating sparksession and giving an app name
spark = SparkSession.builder.appName('sparkdf').getOrCreate()
# create Dataframe
# Display Schema
# Show Dataframe


Method 1: Using withColumn()

withColumn() is used to add a new or update an existing column on DataFrame

Syntax: df.withColumn(colName, col)

Returns: A new :class:`DataFrame` by adding a column or replacing the existing column that has the same name. 



    'Avg_runs', df.Runs / df.Matches).withColumn(
    'wkt+10', df.Wickets+10).show()


Method 2: Using select()

You can also add multiple columns using select.




# Using select() to Add Multiple Column'*', (df.Runs / df.Matches).alias('Avg_runs'),

Output :

Method 3: Adding a Constant multiple Column to DataFrame Using withColumn() and select()

Let’s create a new column with constant value using lit() SQL function, on the below code. The lit() function present in Pyspark is used to add a new column in a Pyspark Dataframe by assigning a constant or literal value.


from pyspark.sql.functions import col, lit'*',lit("Cricket").alias("Sport")).


 Attention geek! Strengthen your foundations with the Python Programming Foundation Course and learn the basics.  

To begin with, your interview preparations Enhance your Data Structures concepts with the Python DS Course. And to begin with your Machine Learning Journey, join the Machine Learning – Basic Level Course

My Personal Notes arrow_drop_up
Recommended Articles
Page :