Architecture and Working of Hive
Prerequisite – Introduction to Hadoop, Apache Hive
The major components of Hive and its interaction with the Hadoop is demonstrated in the figure below and all the components are described further:
- User Interface (UI) –
As the name describes User interface provide an interface between user and hive. It enables user to submit queries and other operations to the system. Hive web UI, Hive command line, and Hive HD Insight (In windows server) are supported by the user interface.
- Hive Server – It is referred to as Apache Thrift Server. It accepts the request from different clients and provides it to Hive Driver.
- Driver –
Queries of the user after the interface are received by the driver within the Hive. Concept of session handles is implemented by driver. Execution and Fetching of APIs modelled on JDBC/ODBC interfaces is provided by the user.
- Compiler –
Queries are parses, semantic analysis on the different query blocks and query expression is done by the compiler. Execution plan with the help of the table in the database and partition metadata observed from the metastore are generated by the compiler eventually.
- Metastore –
All the structured data or information of the different tables and partition in the warehouse containing attributes and attributes level information are stored in the metastore. Sequences or de-sequences necessary to read and write data and the corresponding HDFS files where the data is stored. Hive selects corresponding database servers to stock the schema or Metadata of databases, tables, attributes in a table, data types of databases, and HDFS mapping.
- Execution Engine –
Execution of the execution plan made by the compiler is performed in the execution engine. The plan is a DAG of stages. The dependencies within the various stages of the plan is managed by execution engine as well as it executes these stages on the suitable system components.
Diagram – Architecture of Hive that is built on the top of Hadoop
In the above diagram along with architecture, job execution flow in Hive with Hadoop is demonstrated step by step.
- Step-1: Execute Query –
Interface of the Hive such as Command Line or Web user interface delivers query to the driver to execute. In this, UI calls the execute interface to the driver such as ODBC or JDBC.
- Step-2: Get Plan –
Driver designs a session handle for the query and transfer the query to the compiler to make execution plan. In other words, driver interacts with the compiler.
- Step-3: Get Metadata –
In this, the compiler transfers the metadata request to any database and the compiler gets the necessary metadata from the metastore.
- Step-4: Send Metadata –
Metastore transfers metadata as an acknowledgment to the compiler.
- Step-5: Send Plan –
Compiler communicating with driver with the execution plan made by the compiler to execute the query.
- Step-6: Execute Plan –
Execute plan is sent to the execution engine by the driver.
- Execute Job
- Job Done
- Dfs operation (Metadata Operation)
- Step-7: Fetch Results –
Fetching results from the driver to the user interface (UI).
- Step-8: Send Results –
Result is transferred to the execution engine from the driver. Sending results to Execution engine. When the result is retrieved from data nodes to the execution engine, it returns the result to the driver and to user interface (UI).
Advantages of Hive Architecture:
Scalability: Hive is a distributed system that can easily scale to handle large volumes of data by adding more nodes to the cluster.
Data Accessibility: Hive allows users to access data stored in Hadoop without the need for complex programming skills. SQL-like language is used for queries and HiveQL is based on SQL syntax.
Data Integration: Hive integrates easily with other tools and systems in the Hadoop ecosystem such as Pig, HBase, and MapReduce.
Flexibility: Hive can handle both structured and unstructured data, and supports various data formats including CSV, JSON, and Parquet.
Security: Hive provides security features such as authentication, authorization, and encryption to ensure data privacy.
Disadvantages of Hive Architecture:
High Latency: Hive’s performance is slower compared to traditional databases because of the overhead of running queries in a distributed system.
Limited Real-time Processing: Hive is not ideal for real-time data processing as it is designed for batch processing.
Complexity: Hive is complex to set up and requires a high level of expertise in Hadoop, SQL, and data warehousing concepts.
Lack of Full SQL Support: HiveQL does not support all SQL operations, such as transactions and indexes, which may limit the usefulness of the tool for certain applications.
Debugging Difficulties: Debugging Hive queries can be difficult as the queries are executed across a distributed system, and errors may occur in different nodes.
Please Login to comment...