Open In App

Spark Trim String Column on DataFrame

Last Updated : 29 Sep, 2023
Improve
Improve
Like Article
Like
Save
Share
Report

In this article, we will see that in PySpark, we can remove white spaces in the DataFrame string column. Here we will perform a similar operation to trim() (removes left and right white spaces) present in SQL in PySpark itself.

PySpark Trim String Column on DataFrame

Below are the ways by which we can trim String Column on DataFrame in PySpark:

  • Using withColumn with rtrim()
  • Using withColumn with trim()
  • Using select()
  • Using SQL Expression
  • Using PySpark trim(), rtrim(), ltrim()

PySpark Trim using withColumn() with rtrim()

In this example, we are using withColumn() with rtrim() to remove white spaces.

Python3




from pyspark.sql import SparkSession
from pyspark.sql.functions import trim, ltrim, rtrim, col
 
# Create a Spark session
spark = SparkSession.builder.appName("WhiteSpaceRemoval").getOrCreate()
 
# Define your data
data = [(1, "ABC    "), (2, "     DEF"), (3, "        GHI    ")]
 
# Create a DataFrame
df = spark.createDataFrame(data, ["col1", "col2"])
 
# Show the initial DataFrame
df.show()
 
# Using withColumn to remove white spaces
df = df.withColumn("col2", rtrim(col("col2")))
df.show()
 
# Stop the Spark session
spark.stop()


+----+----------+
|col1|      col2|
+----+----------+
|   1|     ABC    |
|   2|     DEF    |
|   3|        GHI|
+----+----------+
+----+---+
|col1|col2|
+----+---+
|   1|ABC|
|   2|DEF|
|   3|GHI|
+----+---+

PySpark Trim using withColumn with trim()

In this example, we are using withColumn() with trim() to remove white spaces.

Python3




from pyspark.sql import SparkSession
from pyspark.sql.functions import trim, ltrim, rtrim, col
 
# Create a Spark session
spark = SparkSession.builder.appName("WhiteSpaceRemoval").getOrCreate()
 
# Define your data
data = [(1, "ABC    "), (2, "     DEF"), (3, "        GHI    ")]
 
# Create a DataFrame
df = spark.createDataFrame(data, ["col1", "col2"])
 
# Show the initial DataFrame
df.show()
 
# Using withColumn to remove white spaces
df = df.withColumn("col2", trim(col("col2")))
df.show()
 
# Stop the Spark session
spark.stop()


+----+----------+
|col1|      col2|
+----+----------+
|   1|     ABC    |
|   2|     DEF    |
|   3|        GHI|
+----+----------+
+----+---+
|col1|col2|
+----+---+
|   1|ABC|
|   2|DEF|
|   3|GHI|
+----+---+

PySpark Trim using select() to Remove White Spaces

In this example, we are using select() to trim in PySpark.

Python3




from pyspark.sql import SparkSession
from pyspark.sql.functions import trim, ltrim, rtrim, col
 
# Create a Spark session
spark = SparkSession.builder.appName("WhiteSpaceRemoval").getOrCreate()
 
# Define your data
data = [(1, "ABC    "), (2, "     DEF"), (3, "        GHI    ")]
 
# Create a DataFrame
df = spark.createDataFrame(data, ["col1", "col2"])
 
# Show the initial DataFrame
df.show()
 
# Using select to remove white spaces
df = df.select(col("col1"), trim(col("col2")).alias("col2"))
df.show()
 
# Stop the Spark session
spark.stop()


+----+----------+
|col1|      col2|
+----+----------+
|   1|     ABC    |
|   2|     DEF    |
|   3|        GHI|
+----+----------+
+----+---+
|col1|col2|
+----+---+
|   1|ABC|
|   2|DEF|
|   3|GHI|
+----+---+

Using SQL Expression

In this example, we are using SQL Expression to remove extra whitespaces.

Python3




from pyspark.sql import SparkSession
from pyspark.sql.functions import trim, ltrim, rtrim, col
 
# Create a Spark session
spark = SparkSession.builder.appName("WhiteSpaceRemoval").getOrCreate()
 
# Define your data
data = [(1, "ABC    "), (2, "     DEF"), (3, "        GHI    ")]
 
# Create a DataFrame
df = spark.createDataFrame(data, ["col1", "col2"])
 
# Show the initial DataFrame
df.show()
 
# Using SQL Expression to remove white spaces
df.createOrReplaceTempView("TAB")
spark.sql("SELECT col1, TRIM(col2) AS col2 FROM TAB").show()
 
# Stop the Spark session
spark.stop()


+----+----------+
|col1|      col2|
+----+----------+
|   1|     ABC    |
|   2|     DEF    |
|   3|        GHI|
+----+----------+
+----+---+
|col1|col2|
+----+---+
|   1|ABC|
|   2|DEF|
|   3|GHI|
+----+---+

Using PySpark trim(), rtrim(), ltrim()

In PySpark, we can easily remove whitespaces or trim by using pyspark.sql.functions.trim() function. We use ltrim() to remove only left white spaces and rtrim() function to remove right white spaces.

Python3




from pyspark.sql.functions import trim, ltrim, rtrim
data = [(1, "ABC    "), (2, "     DEF"), (3, "        GHI    ")]
df = spark.createDataFrame(data=data, schema=["col1", "col2"])
df.show()
 
 
# using withColumn and trim()
df.withColumn("col2", trim("col2")).show()
 
# using ltrim()
df.withColumn("col2", ltrim("col2")).show()
 
# using rtrim()
df.withColumn("col2", rtrim("col2")).show()
 
# Using select
df.select("col1", trim("col2").alias('col2')).show()
 
# Using SQL Expression
df.createOrReplaceTempView("TAB")
spark.sql("select col1,trim(col2) as col2 from TAB").show()


+----+----------+
|col1|      col2|
+----+----------+
|   1|     ABC    |
|   2|     DEF    |
|   3|        GHI|
+----+----------+
+----+---+
|col1|col2|
+----+---+
|   1|ABC|
|   2|DEF|
|   3|GHI|
+----+---+
+----+---+
|col1|col2|
+----+---+
|   1|ABC|
|   2|DEF|
|   3|GHI|
+----+---+
+----+---+
|col1|col2|
+----+---+
|   1|ABC|
|   2|DEF|
|   3|GHI|
+----+---+
+----+---+
|col1|col2|
+----+---+
|   1|ABC|
|   2|DEF|
|   3|GHI|
+----+---+


Like Article
Suggest improvement
Share your thoughts in the comments

Similar Reads