HDFS is one of the major components of Hadoop that provide an efficient way for data storage in a Hadoop cluster. But before understanding the features of HDFS, let us know what is a file system and a distributed file system. We can say a file system is a data storage architecture that an operating system uses to manage and maintain the files on its storage space. An example of the windows file system is NTFS(New Technology File System) and FAT32(File Allocation Table 32). FAT32 is used in some older versions of windows but can be utilized on all versions of Windows XP. Similarly like windows, we have ext3, ext4 kind of file system for Linux OS.
Similar, HDFS is a distributed file system. A DFS(Distributed File System) is a client-server based application. It allows clients to access and process the data that is stored on the servers. Whenever the client requests a file from the server the server sends a copy of the file to the client, which is then cached on the client’s computer while the data is being processed and then the processed data is returned to the server.
Let’s know about HDFS on Hadoop. HDFS is a default file system for Hadoop where HDFS stands for Hadoop Distributed File System. It is designed to store a massive volume of data and provide access to this data to so many clients. So the Hadoop application utilizes HDFS as a primary storage system. HDFS is similar to the google file system that well organized the file and stores the data in a distributed manner on various nodes or machines. Now, let us discuss the Top-notch features of HDFS that makes it more favorable.
1. Run-on low-cost system i.e. commodity hardware
Hadoop Distributed File System is very much similar to the existing Distributed File System but it differs in several aspects like commodity hardware. The Hadoop HDFS does not require specialized hardware to store and process very large size data, rather it is designed to work on low-cost clusters of commodity hardware. Where clusters mean a group of computers that are connected Which are cheap and affordable.
2. Provide High Fault Tolerance
HDFS provides high fault tolerance, Fault tolerance is achieved when the system functions properly without any data loss even if some hardware components of the system has failed. In a cluster when a single node fails it crashes the entire system. The primary duty of fault tolerance is to remove such failed nodes which disturbs the entire normal functioning of the system. By default, in HDFS every data block is replicated in 3 data nodes. If a data node goes down the client can easily fetch the data from the other 2 data nodes where data is replicated hence it prevents the entire system to go down and Fault tolerance is achieved in a Hadoop cluster. The HDFS is flexible enough to add and remove the data nodes with fewer efforts. There are 3 ways with which HDFS can achieve fault tolerance i.e. Data replication, Heartbeat Messages, and checkpoints, and recovery.
3. Large Data Set
In the case of HDFS large data set means the data that is in hundreds of megabytes, gigabytes, terabytes, or sometimes even in petabytes in size. It is preferable to use HDFS for files of very large size instead of using so many small files because metadata of a large number of small files consumes a very large space in the memory than that of the less number of entries for large files in name node.
4. High Throughput
HDFS is designed to be a High Throughput batch processing system rather than providing low latency interactive uses. HDFS always implements WORM pattern i.e. Write Once Read Many. The data is immutable means once the data is written it can not be changed. Due to which data is the same across the network. Thus it can process large data in a given amount of time and hence provides High Throughput.
5. Data Locality
HDFS allows us to store and process massive size data on the cluster made up of commodity hardware. Since the data is significantly large so the HDFS moves the computation process i.e. Map-Reduce program towards the data instead of pulling the data out for computation. These minimize network congestion and increase the overall throughput of the system.
As HDFS stores the large size data over multiple nodes, so when the requirement of data storing is increased or decreased the number of nodes can be scaled up or scaled down in a cluster. The vertical and Horizontal scalability is the 2 different mechanisms available to provide scalability in the cluster. Vertical scalability means adding the resource like Disk space, RAM on the existing node of the cluster. On another hand in Horizontal scaling, we increase the number of nodes in the cluster and it is more preferable since we can have hundreds of nodes in a cluster.