Most of us are familiar with the term Rack. The rack is a physical collection of nodes in our Hadoop cluster (maybe 30 to 40). A large Hadoop cluster is consists of many Racks. With the help of this Racks information, Namenode chooses the closest Datanode to achieve maximum performance while performing the read/write information which reduces the Network Traffic. A rack can have multiple data nodes storing the file blocks and their replica’s. The Hadoop itself is so smart that it will automatically write a particular file block in 2 different Data nodes in Rack. If you want to store that block of data into more than 2 Racks then you can do that. Also as this feature is configurable means you can change it Manually. Example of Rack in a cluster:
- There should not be more than 1 replica on the same Datanode.
- More than 2 replica’s of a single block is not allowed on the same Rack.
- The number of racks used inside a Hadoop cluster must be smaller than the number of replicas.
Now let’s continue with our above example. In the diagram, we can easily found that we have block 1 in the first Datanode of Rack 1 and 2 replica’s of Block 1 in 5 and 6 number Data node of Rack which sum up to 3. Similarly, we also have a Replica distribution of 2 other blocks in different Racks which are following the above policies. Benefits of Implementing Rack Awareness in our Hadoop Cluster:
- With the rack awareness policy’s we store the data in different Racks so no way to lose our data.
- Rack awareness helps to maximize the network bandwidth because the data blocks transfer within the Racks.
- It also improves the cluster performance and provides high data availability.
HDFS Rack Awareness Example: