Split single column into multiple columns in PySpark DataFrame
pyspark.sql.functions provide a function split() which is used to split DataFrame string Column into multiple columns.
Syntax: pyspark.sql.functions.split(str, pattern, limit=- 1)
Parameters:
- str: str is a Column or str to split.
- pattern: It is a str parameter, a string that represents a regular expression. This should be a Java regular expression.
- limit: It is an int parameter. Optional an integer value when specified controls the number of times the pattern is applied.
- limit > 0: The resulting array length must not be more than limit specified.
- limit <= 0: The pattern must be applied as many times as possible or till the limit.
First Let’s create a DataFrame.
Python3
!pip install pyspark
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
spark = SparkSession.builder.appName( "sparkdf" ).getOrCreate()
data = [( 'Jaya' , 'Sinha' , 'F' , '1991-04-01' ),
( 'Milan' , 'Sharma' , ' ', ' 2000 - 05 - 19 '),
( 'Rohit' , 'Verma' , 'M' , '1978-09-05' ),
( 'Maria' , 'Anne' , 'F' , '1967-12-01' ),
( 'Jay' , 'Mehta' , 'M' , '1980-02-17' )
]
columns = [ 'First Name' , 'Last Name' , 'Gender' , 'DOB' ]
df = spark.createDataFrame(data, columns)
df.printSchema()
df.show()
|
Output:
DataFrame created
Example 1: Split column using withColumn()
In this example, we created a simple dataframe with the column ‘DOB’ which contains the date of birth in yyyy-mm-dd in string format. Using the split and withColumn() the column will be split into the year, month, and date column.
Python3
split_cols = pyspark.sql.functions.split(df[ 'DOB' ], '-' )
df1 = df.withColumn( 'Year' , split_cols.getItem( 0 )) \
.withColumn( 'Month' , split_cols.getItem( 1 )) \
.withColumn( 'Day' , split_cols.getItem( 2 ))
df1.show()
|
Output:
Dataframe after splitting columns
Alternatively, we can also write like this, it will give the same output:
Python3
df2 = df.withColumn( 'Year' , split(df[ 'DOB' ], '-' ).getItem( 0 )) \
.withColumn( 'Month' , split(df[ 'DOB' ], '-' ).getItem( 1 )) \
.withColumn( 'Day' , split(df[ 'DOB' ], '-' ).getItem( 2 ))
df2.show()
|
Output:
In the above example we have used 2 parameters of split() i.e.’ str’ that contains the column name and ‘pattern’ contains the pattern type of the data present in that column and to split data from that position.
Example 2: Split column using select()
In this example we will use the same DataFrame df and split its ‘DOB’ column using .select():
Python3
data = [( 'Jaya' , 'Sinha' , 'F' , '1991-04-01' ),
( 'Milan' , 'Sharma' , ' ', ' 2000 - 05 - 19 '),
( 'Rohit' , 'Verma' , 'M' , '1978-09-05' ),
( 'Maria' , 'Anne' , 'F' , '1967-12-01' ),
( 'Jay' , 'Mehta' , 'M' , '1980-02-17' )
]
columns = [ 'First Name' , 'Last Name' , 'DOB' ]
df = spark.createDataFrame(data, columns)
df.printSchema()
df.show()
split_cols = pyspark.sql.functions.split(df[ 'DOB' ], '-' )
df3 = df.select( 'First Name' , 'Last Name' , 'Gender' , 'DOB' ,
split_cols.getItem( 0 ).alias( 'year' ),
split_cols.getItem( 1 ).alias( 'month' ),
split_cols.getItem( 2 ).alias( 'day' ))
df3.show()
|
Output:
In the above example, we have not selected the ‘Gender’ column in select(), so it is not visible in resultant df3.
Example 3: Splitting another string column
Python3
data = [( 'Jaya' , 'Sinha' ), ( 'Milan' , 'Soni' ),
( 'Rohit' , 'Verma' ), ( 'Maria' , 'Anne' ),
( 'Jay' , 'Mehta' )]
columns = [ 'First Name' , 'Last Name' ]
df = spark.createDataFrame(data, columns)
df.printSchema()
df.show()
split_cols = pyspark.sql.functions.split(df[ 'Last Name' ], '')
df = df.withColumn( '1' , split_cols.getItem( 0 )) \
.withColumn( '2' , split_cols.getItem( 1 )) \
.withColumn( '3' , split_cols.getItem( 2 )) \
.withColumn( '4' , split_cols.getItem( 3 )) \
.withColumn( '5' , split_cols.getItem( 4 ))
df.show()
|
Output:
In the above example, we have taken only two columns First Name and Last Name and split the Last Name column values into single characters residing in multiple columns.
Last Updated :
09 May, 2021
Like Article
Save Article
Share your thoughts in the comments
Please Login to comment...