Hive can be used to manage structured data on the top of Hadoop. The data is stored in the form of a table inside a database. In Hive, the user is allowed to create Internal as well as External tables to manage and store data in a database. In this article, we will be discussing the difference between Hive Internal and external tables with proper practical implementation. Both Internal and External table has their own use case and can be used as per the requirement. For example, External tables are preferred over internal tables when we want to use the data shared with other tools on Hadoop like apache pig.
In general, whenever we create a table inside a database in the Hive by default it is an Internal table also called the managed table. The reason Internal tables are managed because the Hive itself manages the metadata and data available inside the table. All the databases internal tables created in the Hive are by default stored at /user/hive/warehouse directory on our HDFS. We can check or override the default storage hub for the hive in the hive.metastore.warehouse.dir property. When Internal tables(managed tables) are dropped all their metadata and table data got deleted permanently from our HDFS and can not be retrieved back. The Managed tables are not of any use when there is a requirement to use data available outside the Hive and also used by some other Hadoop utility on our HDFS(Hadoop Distributed File System) and the External table came into the picture.
Key points to remember about Internal tables:
- Hive takes the data file we load to the table to the /database-name/table-name inside our warehouse
- Internal table supports TRUNCATE command
- Internal tables also have ACID Support
- Internal tables also support query result caching means it can store the result of the already executed hive query for subsequent query
- Metadata and Table data both will be removed as soon as the table is dropped
Let’s perform a small demo to understand the concept of Internal tables in Hive.
To perform the below operation make sure your hive is running. Below are the steps to launch a hive on your local system.
Step 1: Start all your Hadoop Daemon
start-dfs.sh # this will start namenode, datanode and secondary namenode start-yarn.sh # this will start node manager and resource manager jps # To check running daemons
Step 2: Launch hive from terminal
Now, we are all set to perform the quick demo.
Step 1: Create The Table with a name test(the table will be created in the default database of the hive if not mentioned in any other database).
CREATE TABLE test( id INT, Name STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
The table we create in the hive by default is an internal or managed table. With the above command, the internal table is successfully created.
Step 2: Load data to this table test with the below command.
LOAD DATA LOCAL INPATH '/home/dikshant/Desktop/data.csv' INTO TABLE test;
The data has been successfully loaded. Whenever we load any file data to the hive table Hive takes that file to the /database-name/table-name in the warehouse directory. You can see this in the below image(Since I have created the table test in the default database so the database name is not available in the path).
Step 3: Let’s see whether the data is loaded into the table or not
select * from test;
Step 4: We can describe the table to see it is Internal or External
describe extended test;
Here in the above image, we can see all the metadata of our table. The table type is shown in the highlighted portion.
Step 5: We can use TRUNCATE to delete the test table data since it is supported in Internal Hive tables.
truncate table test;
Now as soon as the test table is truncated all table data will be removed from our warehouse since hive has ownership of internal tables. Since we have truncated the table so we can see the table structure as shown below.
Step 6: Drop the table test(Now the metadata will also be deleted)
drop table test;
External tables are an excellent way to manage data on the Hive since Hive does not have ownership of the data stored inside External tables. In case, if the user drops the External tables then only the metadata of tables will be removed and the data will be safe. The EXTERNAL keyword in the CREATE TABLE statement is used to create external tables in Hive. We also have to mention the location of our HDFS from where it takes the data. All the use cases where shareable data is available on HDFS so that Hive and other Hadoop components like Pig can also use the same data External tables are required. The metadata for External tables is managed by Hive but these tables take data from other locations on our HDFS.
Key points to remember about External tables
- Hive won’t take data to our warehouse
- The External table does not support the TRUNCATE command
- No support for ACID transaction property
- Doesn’t support query result caching
- Only metadata will be removed when the External table is dropped
Let’s perform a small demo to understand the concept of External tables in Hive.
Step 1: Create a directory with the name /TableData in HDFS
hdfs dfs -mkdir /TableData
Step 2: Now copy the data file you want to use with hive external table to this directory(In my case data.csv)
hdfs dfs -put /home/dikshant/Desktop/data.csv /TableData
Step 3: Now create an External table test with the EXTERNAL keyword as shown below(the table will be created in the default database of the hive if not mentioned in any other database)
CREATE EXTERNAL TABLE test( id INT, Name STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LOCATION '/TableData';
The table test created with the EXTERNAL keyword will now take the data from the directory we have mentioned in the LOCATION attribute.
Step 4: Let’s perform the select query on the external table test we have created
select * from test;
Now in the above image, we can clearly see that the file has some data but we haven’t uploaded anything to it so how this is possible. It is possible because EXTERNAL tables are designed in such a way that they take data from the mentioned location. the location is /TableData in our case and we have a data.csv file inside it so Hive external table will automatically take data from this location.
Step 5: Since the data is available outside of our warehouse and Hive does not have ownership of it so TRUNCATE is not possible with External tables.
truncate table test;
Step 6: Also if we drop this external table then only the table metadata will be removed but not the actual data stored at /TableData on our HDFS.
drop table test;
In the below image, we can see that the /TableData is safe and it is not deleted.
This is all about Internal and External tables in Hive.
Attention reader! Don’t stop learning now. Get hold of all the important DSA concepts with the DSA Self Paced Course at a student-friendly price and become industry ready.