PySpark Row using on DataFrame and RDD

Last Updated : 16 Oct, 2023

You can access the rows in the data frame like this: Attribute, dictionary value. Row allows you to create row objects using named arguments. A named argument cannot be omitted to indicate that the value is “none” or does not exist. In this case, you should explicitly set this to None.

Subsequent changes in version 3.0.0: Rows created from named arguments are now sorted by the position entered instead of alphabetically by field name. A row in PySpark is an immutable, dynamically typed object containing a set of key-value pairs, where the keys correspond to the names of the columns in the DataFrame.

Rows can be created in a number of ways, including directly instantiating a Row object with a range of values, or converting an RDD of tuples to a DataFrame. In pyspark, DataFrames are based on RDDs but provide a more structured and streamlined way to manipulate data using SQL-like queries and transformations. In this context, a Row object represents a record in a DataFrame or an element in an RDD of tuples.

1. Creating a Row object in PySpark

Approach:

Import Row from pyspark.sql
Create a row using Row()
Access the columns in data using Attribute value.

Python3

from pyspark.sql import Row

# Create a Row object with three columns: name, age, and city
row = Row(name='GeeksForGeeks', age=25, city='India')

# Access the values of the row using dot notation
print(row.name)
print(row.age)
print(row.city)

2. Create Dataframe by using Row in pyspark

You can also create a data frame by using rows to specify a schema, which is a set of column names and data types.

Import StructType, StructField, StringType, Integer Type from pyspark.sql.types, SparkSession from pyspark.sql

Create a SparkSesion, define the schema using StructType and StructField

Create a list of row objects to create a dataframe

Dispaly the contents of the dataframe using show().

Python3

from pyspark.sql.types import StructType, StructField, StringType, IntegerType from pyspark.sql import SparkSession # Creating sparksession spark=SparkSession.builder.appName("Dataframe using Row example").getOrCreate() # Define the schema for a DataFrame with three columns: name, age, and city schema = StructType([ StructField('name', StringType(), nullable=False), StructField('age', IntegerType(), nullable=False), StructField('city', StringType(), nullable=False) ]) # Create a list of Row objects rows = [ Row(name='John', age=30, city='New York'), Row(name='Mary', age=25, city='Los Angeles'), Row(name='Bob', age=35, city='Chicago') ] # Create a DataFrame from the rows and schema df = spark.createDataFrame(rows, schema) # Display the contents of the DataFrame df.show()

# Convert the DataFrame to an RDD of Rows rdd = df.rdd # Filter the rows by city rdd_city = rdd.filter(lambda row: row.city == 'New York' or row.city == 'Chicago') # Compute the average age for each city rdd_avg_age = rdd_city.map(lambda row: (row.city, row.age)).groupByKey().mapValues(lambda ages: sum(ages) / len(ages)) # Display the results print(rdd_avg_age.collect())

# Creating a class using Row Student = Row("name", "age","sex") # Inserting data into object s1=Student("ajay", 24,'Male') s2=Student("sesh", 26,'Male') # Printing the object values print(s1.name+','+s2.name)

# Example to demonstrate Rows using rdd and dataframe from pyspark.sql import SparkSession, Row # Create a SparkSession spark = SparkSession.builder.appName("RowExample").getOrCreate() # Create a data data = [(1, "John", "M",32,45000), (2, "Jane", "M",35,65000), (3, "Bob", "M",30,60000), (4, "Alice", "M",20,25000),(5,"shreya","F",26,45000)] # Create a list of Row objects from the data rows = [Row(*line) for line in data] header=["id","name","gender","age","salary"] # Create a DataFrame from the rows and the header df = spark.createDataFrame(rows, header) # Display the contents of the DataFrame df.show() # Add a new column to the DataFrame using Row objects new_rows = [Row(*row,row[4] * 0.1) for row in rows] # Adding new header new_header = header + ["bonus"] # Creating new dataframe new_df = spark.createDataFrame(new_rows, new_header) # Display the contents of the new DataFrame new_df.show() # Compute the average age of male and female employees using RDDs and Row objects gender_age_rdd = df.rdd.map(lambda row: (row.gender, int(row.age))) gender_age_sum_rdd = gender_age_rdd.reduceByKey(lambda x, y: x + y) gender_count_rdd = gender_age_rdd.map(lambda x: (x[0], 1)).reduceByKey(lambda x, y: x + y) gender_avg_age_rdd = gender_age_sum_rdd.join(gender_count_rdd).mapValues(lambda x: x[0] / x[1]) gender_avg_age_rows = [Row(gender=gender, avg_age=avg_age) for gender, avg_age in gender_avg_age_rdd.collect()] gender_avg_age_df = spark.createDataFrame(gender_avg_age_rows) # Display the average age of male and female employees gender_avg_age_df.show()

PySpark Row using on DataFrame and RDD

1. Creating a Row object in PySpark

2. Create Dataframe by using Row in pyspark

3. RDD operations on Row Dataframe in pyspark

4. Creating custom class from Row in pyspark

Example 4: Output from databricks

5. Example on RDD, Dataframe using Row in pyspark

Similar Reads

PySpark Row using on DataFrame and RDD

1. Creating a Row object in PySpark

2. Create Dataframe by using Row in pyspark

3. RDD operations on Row Dataframe in pyspark

4. Creating custom class from Row in pyspark

Example 4: Output from databricks 5. Example on RDD, Dataframe using Row in pyspark

Please Login to comment...

Similar Reads

What kind of Experience do you want to share?

Example 4: Output from databricks

5. Example on RDD, Dataframe using Row in pyspark