Characteristics of HDFS

Last Updated : 24 Apr, 2023

HDFS is one of the major components of Hadoop that provide an efficient way for data storage in a Hadoop cluster. But before understanding the features of HDFS, let us know what is a file system and a distributed file system. We can say a file system is a data storage architecture that an operating system uses to manage and maintain the files on its storage space. An example of the windows file system is NTFS(New Technology File System) and FAT32(File Allocation Table 32). FAT32 is used in some older versions of windows but can be utilized on all versions of Windows XP. Similarly like windows, we have ext3, ext4 kind of file system for Linux OS.

Similar, HDFS is a distributed file system. A DFS(Distributed File System) is a client-server based application. It allows clients to access and process the data that is stored on the servers. Whenever the client requests a file from the server the server sends a copy of the file to the client, which is then cached on the client’s computer while the data is being processed and then the processed data is returned to the server.

Let’s know about HDFS on Hadoop. HDFS is a default file system for Hadoop where HDFS stands for Hadoop Distributed File System. It is designed to store a massive volume of data and provide access to this data to so many clients. So the Hadoop application utilizes HDFS as a primary storage system. HDFS is similar to the google file system that well organized the file and stores the data in a distributed manner on various nodes or machines. Now, let us discuss the Top-notch features of HDFS that makes it more favorable.

1. Run-on low-cost system i.e. commodity hardware

Hadoop Distributed File System is very much similar to the existing Distributed File System but it differs in several aspects like commodity hardware. The Hadoop HDFS does not require specialized hardware to store and process very large size data, rather it is designed to work on low-cost clusters of commodity hardware. Where clusters mean a group of computers that are connected Which are cheap and affordable.

2. Provide High Fault Tolerance

HDFS provides high fault tolerance, Fault tolerance is achieved when the system functions properly without any data loss even if some hardware components of the system has failed. In a cluster when a single node fails it crashes the entire system. The primary duty of fault tolerance is to remove such failed nodes which disturbs the entire normal functioning of the system. By default, in HDFS every data block is replicated in 3 data nodes. If a data node goes down the client can easily fetch the data from the other 2 data nodes where data is replicated hence it prevents the entire system to go down and Fault tolerance is achieved in a Hadoop cluster. The HDFS is flexible enough to add and remove the data nodes with fewer efforts. There are 3 ways with which HDFS can achieve fault tolerance i.e. Data replication, Heartbeat Messages, and checkpoints, and recovery.

3. Large Data Set

In the case of HDFS large data set means the data that is in hundreds of megabytes, gigabytes, terabytes, or sometimes even in petabytes in size. It is preferable to use HDFS for files of very large size instead of using so many small files because metadata of a large number of small files consumes a very large space in the memory than that of the less number of entries for large files in name node.

4. High Throughput

HDFS is designed to be a High Throughput batch processing system rather than providing low latency interactive uses. HDFS always implements WORM pattern i.e. Write Once Read Many. The data is immutable means once the data is written it can not be changed. Due to which data is the same across the network. Thus it can process large data in a given amount of time and hence provides High Throughput.

5. Data Locality

HDFS allows us to store and process massive size data on the cluster made up of commodity hardware. Since the data is significantly large so the HDFS moves the computation process i.e. Map-Reduce program towards the data instead of pulling the data out for computation. These minimize network congestion and increase the overall throughput of the system.

6. Scalability

As HDFS stores the large size data over multiple nodes, so when the requirement of data storing is increased or decreased the number of nodes can be scaled up or scaled down in a cluster. The vertical and Horizontal scalability is the 2 different mechanisms available to provide scalability in the cluster. Vertical scalability means adding the resource like Disk space, RAM on the existing node of the cluster. On another hand in Horizontal scaling, we increase the number of nodes in the cluster and it is more preferable since we can have hundreds of nodes in a cluster.

7. Data Compression
HDFS provides built-in support for data compression. Data compression reduces the amount of storage required for large data sets, which in turn reduces the storage cost. HDFS uses the zlib compression algorithm by default and supports other compression algorithms like Snappy, LZO, and Gzip. The compression algorithm can be specified at the time of creating a file or directory.

8. Security
HDFS provides security for the stored data through features like authentication, authorization, and encryption. Authentication is the process of verifying the identity of the user who is accessing the system, while authorization is the process of determining whether a user has the required permissions to access the data. Encryption is the process of encoding the data in such a way that it cannot be read by unauthorized users. HDFS provides Kerberos authentication and Access Control Lists (ACLs) for authorization.

9. Easy Integration
HDFS is designed to be easily integrated with other Hadoop ecosystem components like Apache Spark, Apache Hive, and Apache Pig. This enables developers to perform various data processing tasks using these tools without having to worry about data storage and retrieval, as HDFS takes care of it. Additionally, HDFS can be accessed using a variety of programming languages like Java, Python, and Scala.

10. Open Source
HDFS is an open-source project developed by Apache Software Foundation. This means that anyone can contribute to the development of the software and make improvements. Open source also means that there is a large community of developers who are constantly working to improve the software and fix bugs. This ensures that HDFS is constantly evolving and improving.

Suggest improvement

HDFS Commands

Share your thoughts in the comments

Characteristics of HDFS

Please Login to comment...

Similar Reads

What kind of Experience do you want to share?