Open In App

HIVE Overview

From the beginning of the Internet’s conventional breakout, many search engine provider companies and e-commerce companies/organizations struggled with regular growth in data day by day. Even some social networking sites like Facebook, Twitter, Instagram, etc. also undergo the same problem. Today, numerous associations understand that the information they gather is a profitable asset for understanding their customers, the impact of their activities in the market, their performance and the effectiveness of their infrastructure, etc. So this is where Hadoop emerged as a preserver which provide us with an efficient way to handle huge datasets using HDFS(Hadoop Distributed File System) and imposes MapReduce for separating calculation errands into units that can be dispersed around a cluster of hardware(commodity hardware) providing scalability(horizontal).
Some big challenges need to be resolved like: How would someone move existing data structure to Hadoop when that framework depends on Relational database system and the Structured Query Language (SQL)? And what about data security, where both master database creators, and admins, and some regular users who use SQL to take information from their data warehouse?
This where the role of HIVE comes into the picture. Hive provides a SQL dialect known as Hive Query Language abbreviated as HQL to retrieve or modify the data. which is stored in the Hadoop. Apache Hive is an open-source data warehouse system built on top of Hadoop Cluster for querying and analyzing large datasets stored in the Hadoop distributed file system. HiveQL automatically converts SQL-like queries into MapReduce jobs.

History of HIVE –

The HIVE is developed by the Data Infrastructure team of Facebook. At Facebook, Hive’s Hadoop cluster is capable to store more than 2 Petabytes of raw data, and daily it processes and loads around 15 Terabytes of data. Now it is being used by many companies also. Later, the Apache Foundation took over Hive and developed it further and made it an Open Source. It is also used and developed by other companies like Netflix, Financial Industry Regulatory Authority (FINRA), etc.



Features –

Hive is a declarative SQL based language, mainly used for data analysis and creating reports. Hive operates on the server-side of a cluster.
Hive provides schema flexibility and evolution along with data summarization, querying of data, and analysis in a much easier manner.
In Hive, we can make two types of tables – partitioned and bucketed which make it feasible to process data stored in HDFS and improves the performance as well.
Hive tables are defined directly in the Hadoop File System(HDFS).
In Hive, we have JDBC/ODBC drivers
Hive is fast and scalable, and easy to learn.
Hive has a rule-based optimizer for optimizing plans.
Using Hive we can also execute Ad-hoc queries to analyze data.

HIVE Architecture –

CLI, UI, and Thrift Server – It is used to provide a user interface to an external user to interact with Hive by writing queries, instructions and monitoring the process. Thrift server allows external clients to interact with Hive over a network, similar to the JDBC or ODBC protocol.



Working –

  1. First of all, the user submits their query and CLI sends that query to the Driver.
  2. Then the driver takes the help of query compiler to check syntax.
  3. Then compiler request for Metadata by sending a metadata request to Metastore.
  4. In response to that request, metastore sends metadata to the compiler.
  5. Then compiler resends the plan to the driver after checking requirements.
  6. The Driver sends the plan to the execution engine.
  7. Execution engine sends the job to Job tracker and assigns the job to Task Tracker. Here, the query executes MapReduce job. And in meantime execution engine executes metadata operations with Metastore.
  8. Then the execution engine fetches the results from the Data Node and sends those results to the driver.
  9. At last, the driver sends the results to the hive interface.

 

HIVE Metastore –

Hive Metastore is the central repository for metadata. It stores metadata for Hive tables (like their schema and location) and partitions in a relational database. It provides client access to this information by using the metastore service API.
Modes:

  1. HCatalog CLI (Command Based) – It is a query-based API which means that it only permits execution and submission of HQL.
  2. Metastore (JAVA) – It is a Thrift based API which is implemented by IMetaStoreClient interface using JAVA. This API decouples metastore storage layer from Hive Internals.
  3. Streaming Data Ingest (JAVA) – It is used to write the continuous streaming data into transactional tables using ACID properties of Hive.
  4. Streaming Mutation (JAVA) – It is used in transformational operations like Update, Insert, Delete to convert it into transactional tables as well using ACID property.
  5. Hive-JDBC ( JDBC) – It is used to support the functionality of JDBC in Hive.

 
Limitations –
Apache Hive has some limitations also:

  1. Read-only views are allowed but materialized views are not allowed.
  2. It does not support triggers.
  3. Apache Hive queries have very high latency.
  4. No difference between NULL and null values.

 
How HIVE is different from RDBMS ?

  1. RDBMS supports schema on Write whereas Hive provides schema on Read.
  2. In Hive, we can write once but in RDBMS we can write as many times as we want.
  3. Hive can handle big datasets whereas RDBMS can’t handle beyond 10TB.
  4. Hive is highly scalable but scalability in RDBMS costs a lost.
  5. Hive has a feature of Bucketing which is not there in RDBMS.
Article Tags :