How to check if something is a RDD or a DataFrame in PySpark ?
Last Updated :
23 Nov, 2022
In this article we are going to check the data is an RDD or a DataFrame using isinstance(), type(), and dispatch methods.
It is used to check particular data is RDD or dataframe. It returns the boolean value.
Syntax: isinstance(data,DataFrame/RDD)
where
- data is our input data
- DataFrame is the method from pyspark.sql module
- RDD is the method from pyspark.sql module
Example Program to check our data is dataframe or not:
Python3
import pyspark
from pyspark.sql import DataFrame
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName( 'sparkdf' ).getOrCreate()
data = [[ 1 , "sravan" , "company 1" ],
[ 2 , "ojaswi" , "company 1" ],
[ 3 , "rohith" , "company 2" ],
[ 4 , "sridevi" , "company 1" ],
[ 1 , "sravan" , "company 1" ],
[ 4 , "sridevi" , "company 1" ]]
columns = [ 'ID' , 'NAME' , 'Company' ]
dataframe = spark.createDataFrame(data, columns)
print ( isinstance (dataframe, DataFrame))
|
Output:
True
Check the data is RDD or not:
By using isinstance() method we can check.
Syntax: isinstance(data,RDD)
where
- data is our input data
- RDDis the method from pyspark.sql module
Example:
Python3
from pyspark.sql import DataFrame
from pyspark.rdd import RDD
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
data = spark.sparkContext.parallelize([( "1" , "sravan" , "vignan" , 67 , 89 ),
( "2" , "ojaswi" , "vvit" , 78 , 89 ),
( "3" , "rohith" , "vvit" , 100 , 80 ),
( "4" , "sridevi" , "vignan" , 78 , 80 ),
( "1" , "sravan" , "vignan" , 89 , 98 ),
( "5" , "gnanesh" , "iit" , 94 , 98 )])
print ( isinstance (data, RDD))
|
Output:
True
Convert the RDD into DataFrame and check the type
Here we will create an RDD and convert it to dataframe using toDF() method and check the data.
Python3
from pyspark.sql import DataFrame
from pyspark.rdd import RDD
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
rdd = spark.sparkContext.parallelize([( 1 , "Sravan" , "vignan" , 98 ),
( 2 , "bobby" , "bsc" , 87 )])
print ( " RDD : " , isinstance (rdd, RDD))
print ( "Dataframe : " , isinstance (rdd, DataFrame))
print ( "Rdd Data : \n" , rdd.collect())
data = rdd.toDF()
print ( "RDD : " , isinstance (rdd, RDD))
print ( "Dataframe : " , isinstance (rdd, DataFrame))
data.collect()
|
Output:
Method 2: Using type() function
type() command is used to return the type of the given object.
Syntax: type(data_object)
Here, dataobject is the rdd or dataframe data.
Example 1: Python program to create data with RDD and check the type
Python3
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
rdd = spark.sparkContext.parallelize([( 1 , "Sravan" , "vignan" , 98 ),
( 2 , "bobby" , "bsc" , 87 )])
print ( type (rdd))
|
Output:
<class 'pyspark.rdd.RDD'>
Example 2: Python program to create dataframe and check the type.
Python3
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName( 'sparkdf' ).getOrCreate()
data = [[ 1 , "sravan" , "company 1" ],
[ 2 , "ojaswi" , "company 1" ],
[ 3 , "rohith" , "company 2" ],
[ 4 , "sridevi" , "company 1" ],
[ 1 , "sravan" , "company 1" ],
[ 4 , "sridevi" , "company 1" ]]
columns = [ 'ID' , 'NAME' , 'Company' ]
dataframe = spark.createDataFrame(data,columns)
print ( type (dataframe))
|
Output:
<class 'pyspark.sql.dataframe.DataFrame'>
Method 3: Using Dispatch
The dispatch decorator creates a dispatcher object with the name of the function and stores this object, We can refer to this object to do the operations. Here we are creating an object to check our data is either RDD or DataFrame. So we are using single dispatch
Example 1: Python code to create a single dispatcher and pass the data and check the data is rdd or not
Python3
from pyspark.rdd import RDD
from pyspark.sql import DataFrame
import pyspark
from pyspark.sql import SparkSession
from functools import singledispatch
from pyspark import SparkContext
sc = SparkContext( "local" , "GFG" )
spark = SparkSession.builder.appName( 'sparkdf' ).getOrCreate()
spark = SparkSession.builder.getOrCreate()
@singledispatch
def check(x):
pass
@check .register(RDD)
def _(arg):
return "RDD"
@check .register(DataFrame)
def _(arg):
return "DataFrame"
print (check(sc.parallelize([( "1" , "sravan" , "vignan" , 67 , 89 )])))
|
Output:
RDD
Example 2: Python code to check whether the data is dataframe or not
Python3
from pyspark.rdd import RDD
from pyspark.sql import DataFrame
import pyspark
from pyspark.sql import SparkSession
from functools import singledispatch
from pyspark import SparkContext
sc = SparkContext( "local" , "GFG" )
spark = SparkSession.builder.appName( 'sparkdf' ).getOrCreate()
spark = SparkSession.builder.getOrCreate()
@singledispatch
def check(x):
pass
@check .register(RDD)
def _(arg):
return "RDD"
@check .register(DataFrame)
def _(arg):
return "DataFrame"
print (check(spark.createDataFrame([( "1" , "sravan" ,
"vignan" , 67 , 89 )])))
|
Output:
DataFrame
Share your thoughts in the comments
Please Login to comment...