How to read hive partitioned table via pyspark

Last Updated : 16 Oct, 2023

Hive is built on Apache Hadoop and employments a high-level association and inquiry dialect called HiveQL (comparative to SQL) to recover large-scale information put away in disseminated capacity frameworks such as the Hadoop Dispersed Record Framework “HDFS”.

In this article, we will learn how to read a Hive Partitioned Table using Pyspark.

Reading Hive Partitioned Table using Pyspark

A Hive table could be a logical representation of organized information put away within the Apache Hive information distribution center framework. Before reading the hive-partitioned table using Pyspark, we need to have a hive-partitioned table.

Let us see the process of creating and reading a Hive Partitioned Table using Pyspark in Python.

Create a Hive Partitioned Table

First, let us create a hive-partitioned table.

Step 1: Import modules

Import the necessary modules that are required to create a hive table in Pyspark.

from pyspark.sql import SparkSession

Step 2: Create a SparkSession

Create a SparkSession with Hive support enabled using the enableHiveSupport() method.

spark = SparkSession.builder.enableHiveSupport().getOrCreate()

Step 3: Create a database (if it doesn’t already exist)

Define the sample data as a Python list of tuples. Create a DataFrame using the sample data and specify the column names.

data = [("John", 25, "Male"), ("Alice", 30, "Female"),]
df = spark.createDataFrame(data, ["Name", "Age", "Gender"])

Step 4: Create a partitioned table and save it

Now, create a partitioned hive table of the DataFrame using ‘write.partitionBy()‘ function which takes the column name as the parameter based on which we want to partition the hive table. Then save the table using ‘saveAsTable()‘ function that takes the table name as the parameter.

table_name = "tablename"
df.write.partitionBy("column_name").saveAsTable(table_name)

Step 5: Display the Partitioned Hive Table

df.show()

Example:

In this example, we start by importing the SparkSession class from the pyspark.sql module. We then created the SparkSession and define a sample DataFrame ‘df' with a larger dataset. It contains three columns: “Name”, “Age”, and “Gender”.

Python3

# pyspark code to create a hive parititon table
# importing modules
from pyspark.sql import SparkSession
 
# Create a SparkSession with Hive support enabled
spark = SparkSession.builder.enableHiveSupport().getOrCreate()
 
# Create a DataFrame with sample data
data = [
    ("John", 25, "Male"),
    ("Alice", 30, "Female"),
    ("Bob", 27, "Male"),
    ("Emma", 28, "Female"),
    ("David", 32, "Male"),
    ("Sophia", 29, "Female"),
    ("James", 31, "Male"),
    ("Olivia", 26, "Female"),
    ("Ethan", 33, "Male"),
    ("Isabella", 27, "Female")
]
df = spark.createDataFrame(data, ["Name", "Age", "Gender"])
 
table_name = "my_partitioned"
 
# Create a partitioned Hive table
df.write.partitionBy("Gender").saveAsTable(table_name)
 
# display the table
df.show()

Output:

Partitioned Hive Table

Show the Hive Table Partitions

We can use the ‘spark.sql()' function to execute a SQL query. It takes the SQL query as a parameter value. The query is as follows:

"SHOW PARTITIONS table_name"

which retrieves the list of partitions for the Hive table.

Python3

# pyspark code to show the partitions
# Show the partitions in the Hive table
spark.sql("SHOW PARTITIONS {}".format(table_name)).show()

Output:

Hive Table Partitions

Read Hive Partitions from the Table

To read a hive partitioned table, we will use the spark.sql() function to execute a SQL query. The query is as follows:

"SELECT * FROM tablename WHERE condition"

which selects all columns (*) from the Hive table where the condition is true for that particular row. The result is stored in the specific_partitions_df DataFrame.

Example: Reading single Partition of the hive table

In this example, we will read a single hive partition from the table. We will display the partitioned table where the gender is Male.

Python3

# pyspark code to show specific partitions
# Read specific partitions from the Hive table
specific_partitions_df = spark.sql(
    "SELECT * FROM {} WHERE Gender = 'Male'".format(table_name))
 
# Display the contents of the dataframe
specific_partitions_df.show()

Output:

Reading Specific partition of Hive Table

Example: Reading multiple partitions of the hive table.

In this example, we will read multiple hive partitions from the table. We will display the partitioned table where the gender is Male as well as Female.

Python3

# pyspark code to display multiple partitions from the table.
# in this case this will be the all data from hive table
# Read multiple partitions from the Hive table
multiple_partitions_df = spark.sql(
    "SELECT * FROM {} WHERE Gender IN ('Male', 'Female')".format(table_name))
 
# Displaying contents of the hive table
multiple_partitions_df.show()

Output:

Reading multiple partition of hive table

Suggest improvement

Show partitions on a Pyspark RDD

Share your thoughts in the comments

How to read hive partitioned table via pyspark

Reading Hive Partitioned Table using Pyspark

Create a Hive Partitioned Table

Python3

Show the Hive Table Partitions

Python3

Read Hive Partitions from the Table

Python3

Python3

Please Login to comment...

Similar Reads

What kind of Experience do you want to share?