pyspark.sql.functions provide a function split() which is used to split DataFrame string Column into multiple columns.
Syntax: pyspark.sql.functions.split(str, pattern, limit=- 1)
- str: str is a Column or str to split.
- pattern: It is a str parameter, a string that represents a regular expression. This should be a Java regular expression.
- limit: It is an int parameter. Optional an integer value when specified controls the number of times the pattern is applied.
- limit > 0: The resulting array length must not be more than limit specified.
- limit <= 0: The pattern must be applied as many times as possible or till the limit.
First Let’s create a DataFrame.
Example 1: Split column using withColumn()
In this example, we created a simple dataframe with the column ‘DOB’ which contains the date of birth in yyyy-mm-dd in string format. Using the split and withColumn() the column will be split into the year, month, and date column.
Alternatively, we can also write like this, it will give the same output:
In the above example we have used 2 parameters of split() i.e.’ str’ that contains the column name and ‘pattern’ contains the pattern type of the data present in that column and to split data from that position.
Example 2: Split column using select()
In this example we will use the same DataFrame df and split its ‘DOB’ column using .select():
In the above example, we have not selected the ‘Gender’ column in select(), so it is not visible in resultant df3.
Example 3: Splitting another string column
In the above example, we have taken only two columns First Name and Last Name and split the Last Name column values into single characters residing in multiple columns.
Attention geek! Strengthen your foundations with the Python Programming Foundation Course and learn the basics.
To begin with, your interview preparations Enhance your Data Structures concepts with the Python DS Course. And to begin with your Machine Learning Journey, join the Machine Learning – Basic Level Course