Skip to content
Related Articles

Related Articles

Improve Article

How to Get substring from a column in PySpark Dataframe ?

  • Last Updated : 09 Sep, 2021
Geek Week

In this article, we are going to see how to get the substring from the PySpark Dataframe column and how to create the new column and put the substring in that newly created column.

We can get the substring of the column using substring() and substr() function.

Syntax: substring(str,pos,len)

df.col_name.substr(start, length)

Parameter:



  • str – It can be string or name of the column from which we are getting the substring.
  • start and pos – Through this parameter we can give the starting position from where substring is start.
  • length and len – It is the length of the substring from the starting position.

Let’s create a dataframe.

Python3




# importing necessary libraries
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, substring
 
 
# function to create new SparkSession
def create_session():
    spk = SparkSession.builder \
        .master("local") \
        .appName("Substring.com") \
        .getOrCreate()
    return spk
 
def create_df(spark, data, schema):
 
    df1 = spark.createDataFrame(data, schema)
    return df1
 
if __name__ == "__main__":
 
    input_data = [("India", +91, 2701, 2020),
                  ("United States of America", +1, 1301, 2020),
                  ("Israel", +972, 3102, 2020),
                  ("Dubai", +971, 2901, 2020),
                  ("Russia", 7, 3101, 2020)]
 
    # calling function to create SparkSession
    spark = create_session()
 
    schema = ["Country", "Country Code",
              "Data", "Year"]
 
    # calling function to create dataframe
    df = create_df(spark, input_data, schema)
    df.show()

Output:

Example 1: Using substring() getting the substring and creating new column using withColumn() function.

Python




if __name__ == "__main__":
   
    # creating Month column and get the
    # substring from the Data column
    # creating Date column and get the
    # substring from the Data column
    df = df.withColumn(
      "Month", substring("Data", 1, 2)).withColumn(
      "Date", substring("Data", 3, 4))
 
    # dropping the Data column from the
    # Dataframe
    df = df.drop("Data")
 
    # printing Dataframe schema to get the
    # column names
    df.printSchema()
 
    # visualizing the dataframe
    df.show(truncate=False)

Output:



Example 2: Creating New_Country column by getting the substring using substr() function.

Python




if __name__ == "__main__":
 
    # Creating the new column New_Country
    # and store the substring using substr()
    df = df.withColumn("New_Country", df.Country.substr(0, 12))
 
    # printing Dataframe schema to get the
    # column names
    df.printSchema()
 
    # visualizing the dataframe
    df.show(truncate=False)

Output:

Example 3: Using substring() with select() function.

Python




if __name__ == "__main__":
 
    input_data = [("India", +91, "AidanButler"),
                  ("United States of America", +1, "ConerFlores"),
                  ("Israel", +972, "RosseBryant"),
                  ("Dubai", +971, "JuliaSimmon"),
                  ("Russia", 7, "AliceBailey")]
 
    # calling function to create SparkSession
    spark = create_session()
 
    schema = ["Country", "Country Code", "Name"]
 
    # calling function to create dataframe
    df = create_df(spark, input_data, schema)
 
    # Selecting the column using select()
    # function and getting substring
    # using substring()
    df2 = df.select('Name', substring('Name',
                                      1, 5).alias('First Name'),
                    substring('Name', 6, 6).alias('Last Name'))
 
    # printing Dataframe schema to get the column names
    df2.printSchema()
 
    # visualizing the dataframe
    df2.show(truncate=False)

Output:

Example 4: Using substring() with selectExpr() function.

Python




if __name__ == "__main__":
 
    input_data = [("India", +91, "AidanButler"),
                  ("United States of America", +1, "ConerFlores"),
                  ("Israel", +972, "RosseBryant"),
                  ("Dubai", +971, "JuliaSimmon"),
                  ("Russia", 7, "AliceBailey")]
 
    # calling function to create SparkSession
    spark = create_session()
 
    schema = ["Country", "Country Code", "Name"]
 
    # calling function to create dataframe
    df = create_df(spark, input_data, schema)
 
    # Selecting the column using selectExpr()
    # function and getting substring using substring()
    df2 = df.selectExpr('Name', 'substring(Name, 1,5) as First_Name',
                        'substring(Name, 6,6) as Last_Name')
 
    # printing Dataframe schema to get the column names
    df2.printSchema()
 
    # visualizing the dataframe
    df2.show(truncate=False)

Output:

 Attention geek! Strengthen your foundations with the Python Programming Foundation Course and learn the basics.  

To begin with, your interview preparations Enhance your Data Structures concepts with the Python DS Course. And to begin with your Machine Learning Journey, join the Machine Learning – Basic Level Course




My Personal Notes arrow_drop_up
Recommended Articles
Page :