Rename Nested Field in Spark Dataframe in Python
Last Updated :
21 Mar, 2023
In this article, we will discuss different methods to rename the columns in the DataFrame like withColumnRenamed or select. In Apache Spark, you can rename a nested field (or column) in a DataFrame using the withColumnRenamed method. This method allows you to specify the new name of a column and returns a new DataFrame with the renamed column.
Required Package
PySpark is the Python library for Spark programming. It allows developers to interact with the Spark cluster using the Python programming language. PySpark is a powerful tool for large-scale data processing and analysis, as it allows you to perform distributed computations on large datasets using the power of the Spark engine. you can install Pyspark using the following command:
!pip install pyspark
Rename Field in spark Dataframe
You can use the withColumnRenamed method to rename a field in a Spark DataFrame. For example, if you have a DataFrame called df and you want to rename the field “oldFieldName” to “newFieldName”, you can use the following code structure:
df.withColumnRenamed("oldFieldName", "newFieldName")
Create the spark DataFrame.
Python3
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName
( "CreateDF" ).getOrCreate()
data = [( 1 , "John" , "a" , 25 ), ( 2 , "Mike" ,
"b" , 30 ), ( 3 , "Sara" , "c" , 35 )]
df = spark.createDataFrame(data,
[ "id" , "fname" , "lname" , "age" ])
df.printSchema()
|
Output:
root
|-- id: long (nullable = true)
|-- fname: string (nullable = true)
|-- lname: string (nullable = true)
|-- age: long (nullable = true)
Change the name of the single column by providing the oldfieldName and the NewFieldName.
Python3
df1 = df.withColumnRenamed( "fname" , "FirstName" )
df1.printSchema()
|
Output:
root
|-- id: long (nullable = true)
|-- FirstName: string (nullable = true)
|-- lname: string (nullable = true)
|-- age: long (nullable = true)
Rename multiple columns then we will write the chain of the withColumnRenamed function
Python3
df2 = (df.withColumnRenamed( "fname" , "FirstName" )
.withColumnRenamed( "lname" , "LastName" )
)
df2.printSchema()
|
Output:
root
|-- id: long (nullable = true)
|-- FirstName: string (nullable = true)
|-- LastName: string (nullable = true)
|-- age: long (nullable = true)
Rename nested field in spark DataFrame
If we have nested columns then we have to redefine the structure of the DataFrame. First, we will define the schema then we will apply the schema using the following code structure:
df.select(col("address").cast(struct_schema)).printSchema()
Create the DataFrame.
Python3
from pyspark.sql.types import StructType, StructField, StringType, IntegerType
schema = StructType([
StructField( "name" , StringType()),
StructField( "age" , IntegerType()),
StructField( "address" , StructType([
StructField( "street" , StringType()),
StructField( "city" , StringType()),
StructField( "zip" , IntegerType())
]))
])
data = [( "Alice" , 25 , { "street" : "Main St" , "city" : "Anytown" , "zip" : 12345 }),
( "Bob" , 30 , { "street" : "Park Ave" , "city" : "New York" , "zip" : 56789 })]
df = spark.createDataFrame(data, schema)
df.show()
df.printSchema()
|
Output:
+-----+---+---------------------------+
|name |age|address |
+-----+---+---------------------------+
|Alice|25 |{Main St, Anytown, 12345} |
|Bob |30 |{Park Ave, New York, 56789}|
+-----+---+---------------------------+
root
|-- name: string (nullable = true)
|-- age: integer (nullable = true)
|-- address: struct (nullable = true)
| |-- street: string (nullable = true)
| |-- city: string (nullable = true)
| |-- zip: integer (nullable = true)
To rename the filed name we have to redefine the structure of the DataFrame while defining the schema we have to pass the newfieldname and its datatype.
Python3
from pyspark.sql.types import LongType, StringType, StructField, StructType
from pyspark.sql.functions import col
struct_schema = StructType([
StructField( "Street_name" , StringType()),
StructField( "city_name" , StringType()),
StructField( "Zip_code" , IntegerType())
])
df.select(col( "address" ).cast(struct_schema)).printSchema()
|
Output:
root
|-- address: struct (nullable = true)
| |-- Street_name: string (nullable = true)
| |-- city_name: string (nullable = true)
| |-- Zip_code: integer (nullable = true)
Share your thoughts in the comments
Please Login to comment...