Open In App

Partition of Timestamp column in Dataframes Pyspark

Last Updated : 02 Jan, 2023
Improve
Improve
Like Article
Like
Save
Share
Report

In this article, we are going to learn the partitioning of timestamp column in data frames using Pyspark in Python.

The timestamp column contains various time fields, such as year, month, week, day, hour, minute, second, millisecond, etc. There occurs various circumstances in which we don’t want all this information for our processing, what we need is just one or two fields. If we are stuck in a similar situation, then we can take use of date_format() function.

date_format() function:

The function which will extract only the information we need in a column like an hour, minute, second, date, day, millisecond, day, month, or year and we can use it further in the data frame is known as date_format() function.

Syntax: date_format(date,format)

Parameters:

  • date: It may be timestamp or timestamp column in the data frame that needs to be partitioned.
  • format: It is the specific format in which you want to partition the column such as ‘YYYY’, ‘MM’, ‘DD’, ‘E’, etc.

Stepwise Implementation of the partitioning of timestamp column in data frames using Pyspark:

Step 1: First of all, import the required libraries, i.e., SparkSession and functions. The SparkSession library is used to create the session while the functions library gives access to all built-in functions available in the Pyspark data frame. 

from pyspark.sql import SparkSession
from pyspark.sql.functions import *

Step 2: Now, create a spark session using the getOrCreate() function

spark_session = SparkSession.builder.getOrCreate()

Step 3: Then, either create the data frame using the createDataFrame() function or read the CSV file using the read.csv() function

data_frame=csv_file = spark_session.read.csv('#Path of CSV file',
                                             sep = ',', inferSchema = True, header = True)

or 

data_frame=spark_session.createDataFrame([(column_data_1), (column_data_2 ), (column_data_3 )],
                                         ['column_name_1', 'column_name_2','column_name_3'])

Step 4: Finally, partition the timestamp columns using the date_format() function with the column name which needs to be partitioned and the format in which it needs to be partitioned as arguments. Also, you can use the alias function to give the fancy name to the partitioned column.

df.select(date_format(df.column_name,"#format").alias("partitioned_column_name")).show()

Example 1:

In this example, we have created the data frame with three columns ‘Employee_Id’, ‘Login_Time’, and ‘Logout_Time’ as given below. Then, we are going to partition the ‘Login_Time’ and ‘Logout_Time’ columns using date_format() function to get login time hour, logout time hour, login time minute, logout time minute, login time second, and logout time second respectively.

Partition of Timestamp column in Dataframes Pyspark

 

Python3




# Import the SparkSession and functions library
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
  
# Create a spark session using getOrCreate() function
spark_session = SparkSession.builder.getOrCreate()
  
# Create the data frame which needs to be partitioned
df = spark_session.createDataFrame(
    [(1, '2021-12-23 9:57:58.589151',
      '2021-12-23 5:17:58.380113'),
     (2, '2021-12-23 9:48:18.583154',
      '2021-12-23 5:48:20.485123'),
     (3, '2021-12-23 10:12:28.289350',
      '2021-12-23 5:30:18.282143')],
    ['Employee_Id', 'Login_Time', 'Logout_Time'])
  
# Partition of Login_Time and Logout_Time columns with hour, minute and second
df.select(df.Employee_Id,
          date_format(df.Login_Time,
                      "hh").alias("Login Time Hour"),
          date_format(df.Logout_Time,
                      "hh").alias("Logout Time Hour"),
          date_format(df.Login_Time,
                      "mm").alias("Login Time Minute"),
          date_format(df.Logout_Time,
                      "mm").alias("Logout Time Minute"),
          date_format(df.Login_Time,
                      "ss").alias("Login Time Second"),
          date_format(df.Logout_Time,
                      "ss").alias("Logout Time Second")
          ).show()


Output:

Partition of Timestamp column in Dataframes Pyspark

 

Example 2:

In this example, we have created the data set with two columns ‘Child_Name’ and ‘Birth_Date’ as given below. Then, we are going to partition the ‘Birth_Date’ column using the date_format() function to get a year, month, date, birth date, birth time, and birthday respectively.

Partition of Timestamp column in Dataframes Pyspark

 

Python3




# Import the SparkSession and functions library
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
  
# Create a spark session using getOrCreate() function
spark_session = SparkSession.builder.getOrCreate()
  
# Create the data frame which needs to be partitioned
df = spark_session.createDataFrame(
    [('2020-12-09', 'Akash'),
     ('2021-07-15 05:17:58.589151', 'Ishita'),
        ('2022-01-29 12:47:34.545151', 'Vinayak')],
    ['birth_date', 'Child_Name'])
  
# Partition of birth_date column with year, month,
# date, birth date, birth time and birth day
df.select(df.Child_Name,
          date_format(df.birth_date,
                      "yyyy").alias("Year"),
          date_format(df.birth_date,
                      "MMMM").alias("Month"),
          date_format(df.birth_date,
                      "dd").alias("Date"),
          date_format(df.birth_date,
                      "MM/dd/yyyy").alias("Birth Date"),
          date_format(df.birth_date,
                      "hh:mm:ss").alias("Birth Time"),
          date_format(df.birth_date,
                      "E").alias("Birth Day")
          ).show()


Output:

Partition of Timestamp column in Dataframes Pyspark

 



Like Article
Suggest improvement
Previous
Next
Share your thoughts in the comments

Similar Reads