How to slice a PySpark dataframe in two row-wise dataframe?
Last Updated :
26 Jan, 2022
In this article, we are going to learn how to slice a PySpark DataFrame into two row-wise. Slicing a DataFrame is getting a subset containing all rows from one index to another.
Method 1: Using limit() and subtract() functions
In this method, we first make a PySpark DataFrame with precoded data using createDataFrame(). We then use limit() function to get a particular number of rows from the DataFrame and store it in a new variable. The syntax of limit function is :
Syntax : DataFrame.limit(num)
Returns : A DataFrame with num number of rows.
We will then use subtract() function to get the remaining rows from the initial DataFrame. The syntax of subtract function is :
Syntax : DataFrame1.subtract(DataFrame2)
Returns : A new DataFrame containing rows in DataFrame1 but not in DataFrame2.
Python
import pyspark
from pyspark.sql import SparkSession
Spark_Session = SparkSession.builder.appName(
'Spark Session'
).getOrCreate()
rows = [[ 'Lee Chong Wei' , 69 , 'Malaysia' ],
[ 'Lin Dan' , 66 , 'China' ],
[ 'Srikanth Kidambi' , 9 , 'India' ],
[ 'Kento Momota' , 15 , 'Japan' ]]
columns = [ 'Player' , 'Titles' , 'Country' ]
df = Spark_Session.createDataFrame(rows, columns)
df1 = df.limit( 3 )
df2 = df.subtract(df1)
df1.show()
df2.show()
|
Output:
Method 2: Using randomSplit() function
In this method, we are first going to make a PySpark DataFrame using createDataFrame(). We will then use randomSplit() function to get two slices of the DataFrame while specifying the fractions of rows that will be present in both slices. The rows are split up RANDOMLY.
Syntax : DataFrame.randomSplit(weights,seed)
Parameters :
- weights : list of double values according to which the DataFrame is split.
- seed : the seed for sampling. This parameter is optional.
Returns : List of split DataFrames
Python
import pyspark
from pyspark.sql import SparkSession
Spark_Session = SparkSession.builder.appName(
'Spark Session'
).getOrCreate()
rows = [[ 'Lee Chong Wei' , 69 , 'Malaysia' ],
[ 'Lin Dan' , 66 , 'China' ],
[ 'Srikanth Kidambi' , 9 , 'India' ],
[ 'Kento Momota' , 15 , 'Japan' ]]
columns = [ 'Player' , 'Titles' , 'Country' ]
df = Spark_Session.createDataFrame(rows, columns)
df1, df2 = df.randomSplit([ 0.20 , 0.80 ])
df1.show()
df2.show()
|
Output:
Method 3: Using collect() function
In this method, we will first make a PySpark DataFrame using createDataFrame(). We will then get a list of Row objects of the DataFrame using :
DataFrame.collect()
We will then use Python List slicing to get two lists of Rows. Finally, we convert these two lists of rows to PySpark DataFrames using createDataFrame().
Python
import pyspark
from pyspark.sql import SparkSession
import pandas as pd
Spark_Session = SparkSession.builder.appName(
'Spark Session'
).getOrCreate()
rows = [[ 'Lee Chong Wei' , 69 , 'Malaysia' ],
[ 'Lin Dan' , 66 , 'China' ],
[ 'Srikanth Kidambi' , 9 , 'India' ],
[ 'Kento Momota' , 15 , 'Japan' ]]
columns = [ 'Player' , 'Titles' , 'Country' ]
df = Spark_Session.createDataFrame(rows, columns)
row_list = df.collect()
part1 = row_list[: 1 ]
part2 = row_list[ 1 :]
slice1 = Spark_Session.createDataFrame(part1)
slice2 = Spark_Session.createDataFrame(part2)
print ( 'First DataFrame' )
slice1.show()
print ( 'Second DataFrame' )
slice2.show()
|
Output:
Method 4: Converting PySpark DataFrame to a Pandas DataFrame and using iloc[] for slicing
In this method, we will first make a PySpark DataFrame using createDataFrame(). We will then convert it into a Pandas DataFrame using toPandas(). We then slice the DataFrame using iloc[] with the Syntax :
DataFrame.iloc[start_index:end_index]
The row at end_index is NOT included. Finally, we will convert our DataFrame slices to a PySpark DataFrame using createDataFrame()
Python
import pyspark
from pyspark.sql import SparkSession
import pandas as pd
Spark_Session = SparkSession.builder.appName(
'Spark Session'
).getOrCreate()
rows = [[ 'Lee Chong Wei' , 69 , 'Malaysia' ],
[ 'Lin Dan' , 66 , 'China' ],
[ 'Srikanth Kidambi' , 9 , 'India' ],
[ 'Kento Momota' , 15 , 'Japan' ]]
columns = [ 'Player' , 'Titles' , 'Country' ]
df = Spark_Session.createDataFrame(rows, columns)
pandas_df = df.toPandas()
df1 = pandas_df.iloc[: 2 ]
df2 = pandas_df.iloc[ 2 :]
df1 = Spark_Session.createDataFrame(df1)
df2 = Spark_Session.createDataFrame(df2)
print ( 'First DataFrame' )
df1.show()
print ( 'Second DataFrame' )
df2.show()
|
Output:
Share your thoughts in the comments
Please Login to comment...