Have you ever been stuck in a situation where you have got the data of numerous columns in one column? Got confused at that time about how to split that dataset? This can be easily achieved in Pyspark in numerous ways. In this article, we will discuss regarding same.
Modules Required:
Pyspark: An open source, distributed computing framework and set of libraries for real-time, large-scale data processing API primarily developed for Apache Spark, is known as Pyspark. This module can be installed through the following command in Python:
pip install pyspark
Methods to split a list into multiple columns in Pyspark:
- Using expr in comprehension list
- Splitting data frame row-wise and appending in columns
- Splitting data frame columnwise
Method 1: Using expr in comprehension list
Step 1: First of all, import the required libraries, i.e. SparkSession. The SparkSession library is used to create the session, while expr is an SQL function used to execute SQL-like expressions. Also, the types is used to store all the datatypes of Pyspark.
from pyspark.sql import SparkSession
from pyspark.sql.functions import expr
from pyspark.sql.types import *
Step 2: Now, create a spark session using the getOrCreate function.
spark_session = SparkSession.builder.getOrCreate()
Step 3: Then, define the schema for creating the data frame with an array-typed column.
mySchema = StructType([StructField("Heading", StringType(), True),
StructField("Column", ArrayType(IntegerType(),True))])
Step 4: Later on, create the data frame that needs to be split into multiple columns.
data_frame = spark_session.createDataFrame([['column_heading1', [column1_data]],
['column_heading2', [column2_data]]],
schema= mySchema)
Step 5: Finally, split the list into columns using expr() function in the comprehension list.
data_frame.select([expr('Column[' + str(x) + ']') for x in range(0, number_of_columns)]).show()
Example:
In this example, we have defined the schema in which we want to define the data frame and then declared the data frame in the respective schema using the list of the data. Finally, we have split that dataset using expr function in the comprehension list.
Python3
from pyspark.sql import SparkSession
from pyspark.sql.functions import expr
from pyspark.sql.types import *
spark_session = SparkSession.builder.getOrCreate()
mySchema = StructType([StructField( "Heading" ,
StringType(), True ), StructField(
"Column" , ArrayType(IntegerType(), True ))])
data_frame = spark_session.createDataFrame(
[[ 'A' , [ 1 , 2 , 3 ]], [ 'B' , [ 4 , 5 , 6 ]], [ 'C' , [ 7 , 8 , 9 ]]],
schema = mySchema)
data_frame.select([expr( 'Column[' + str (x) + ']' ) for x in range ( 0 , 3 )]).show()
|
Output:
+---------+---------+---------+
|Column[0]|Column[1]|Column[2]|
+---------+---------+---------+
| 1| 2| 3|
| 4| 5| 6|
| 7| 8| 9|
+---------+---------+---------+
Method 2: Splitting data frame row-wise and appending in columns
Step 1: First of all, import the required libraries, i.e. SparkSession. The SparkSession library is used to create the session, while Row is used to represent Row in the data frame. Also, the col is used to represent the column in the data frame.
from pyspark.sql import SparkSession
from pyspark import Row
from pyspark.sql.functions import col
Step 2: Now, create a spark session using the getOrCreate function.
spark_session = SparkSession.builder.getOrCreate()
Step 3: Then, declare an array that you need to split into multiple columns.
arr=[[row1_data],[row2_data],[row3_data]]
Step 4: Later on, create the number of rows in the data frame.
data_frame = spark_session.createDataFrame([Row(index=1, finalArray = arr[0]),
Row(index=2, finalArray = arr[1]),
Row(index=3, finalArray = arr[2])])
Step 5: Finally, append the columns to the data frame.
data_frame.select([(col("finalArray")[x]).alias("Column "+str(x+1)) for x in range(0, 3)]).show()
Example:
In this example, we have declared the list for which we created the data frame that we have split row-wise and then put that split data in the columns for display.
Python3
from pyspark.sql import SparkSession
from pyspark import Row
from pyspark.sql.functions import col
spark_session = SparkSession.builder.getOrCreate()
arr = [[ 1 , 2 , 3 ], [ 4 , 5 , 6 ], [ 7 , 8 , 9 ]]
data_frame = spark_session.createDataFrame([Row(index = 1 ,
finalArray = arr[ 0 ]), Row(index = 2 , finalArray = arr[ 1 ]),
Row(index = 3 , finalArray = arr[ 2 ])])
data_frame.select([(col( "finalArray" )[x]).alias( "Column " + str (x + 1 ))
for x in range ( 0 , 3 )]).show()
|
Output:
+--------+--------+--------+
|Column 1|Column 2|Column 3|
+--------+--------+--------+
| 1| 2| 3|
| 4| 5| 6|
| 7| 8| 9|
+--------+--------+--------+
Method 3: Splitting data frame columnwise
Step 1: First of all, import the required libraries, i.e. SparkSession. The SparkSession library is used to create the session.
from pyspark.sql import SparkSession
Step 2: Now, create a spark session using the getOrCreate function.
spark_session = SparkSession.builder.getOrCreate()
Step 3: Then, create a spark context.
sc=spark_session.sparkContext
Step 4: Later on, create the data frame that needs to be split into multiple columns.
data_frame = spark_session.createDataFrame(sc.parallelize([['column_heading1', [column1_data]],
['column_heading2', [column2_data]]]),
["key", "value"])
Step 5: Finally, split the data frame column-wise.
data_frame.select("key", data_frame.value[0], data_frame.value[1], data_frame.value[2]).show()
Example:
In this example, we have declared the list using Spark Context and then created the data frame of that list. Further, we have split the list into multiple columns and displayed that split data.
Python3
from pyspark.sql import SparkSession
spark_session = SparkSession.builder.getOrCreate()
sc = spark_session.sparkContext
data_frame = spark_session.createDataFrame(
sc.parallelize([[ 'Column 1' , [ 1 , 2 , 3 ]], [
'Column 2' , [ 4 , 5 , 6 ]], [ 'Column 3' , [ 7 , 8 , 9 ]]]), [ "key" , "value" ])
data_frame.select(
"key" , data_frame.value[ 0 ], data_frame.value[ 1 ],
data_frame.value[ 2 ]).show()
|
Output:
+--------+--------+--------+--------+
| key|value[0]|value[1]|value[2]|
+--------+--------+--------+--------+
|Column 1| 1| 2| 3|
|Column 2| 4| 5| 6|
|Column 3| 7| 8| 9|
+--------+--------+--------+--------+
Like Article
Suggest improvement
Share your thoughts in the comments
Please Login to comment...