HDFS is a distributed file system that stores data over a network of commodity machines. HDFS works on the streaming data access pattern means it supports write-ones and read-many features. Read operation on HDFS is very important and also very much necessary for us to know while working on HDFS that how actually reading is done on HDFS(Hadoop Distributed File System). Let’s understand how HDFS data read works.
Reading on HDFS seems to be simple but it is not. Whenever a client sends a request to HDFS to read something from HDFS the access to the data or DataNode where actual data is stored is not directly granted to the client because the client does not have the information about the data i.e. on which DataNode data is stored or where the replica of data is made on DataNodes. Without knowing information about the DataNodes the client can never access or read data from HDFS.
So, that’s why the client first sends the request to NameNode since the NameNode contains all the metadata or information we require to perform read operation on HDFS. Once the request is received by the NameNode it responds and sends all the information like the number of DataNodes, the location where the replica is made, the number of data blocks and their location, etc to the client. Now the client can read data with all this information provided by the NameNode. The client reads the data parallelly since the replica of the same data is available on the cluster. Once the whole data is read it combines all the blocks as the original file.
Let’s understand data read on HDFS with a suitable diagram
Components that we have to know before learning HDFS read operation.
NameNode: The primary purpose of Namenode is to manage all the MetaData. As we know the data is stored in the form of blocks in a Hadoop cluster. So on which DataNode or on which location that block of the file is stored is mentioned in MetaData. Log of the Transaction happening in a Hadoop cluster, when or who read or write the data, all this information will be stored in MetaData.
DataNode: DataNode is a program run on the slave system that serves the read/write request from the client and used to store data in form of blocks.
HDFS Client: HDFS Client is an intermediate component between HDFS and the user. It communicates with the Datanode or Namenode and fetches the essential output that the user requests.
In the above, image we can see that first, we send the request to our HDFS client which is a set of programs. Now, this HDFS client contacts the NameNode because it has all information or metadata about the file we want to read. The NamoNode responds and then sends all the metadata back to the HDFS client. Once the HDFS client knows from which location it has to pick the data block, It asks the FS Data Input Stream to point out those blocks of data on data nodes. The FS Data Input Stream then does some processing and made this data available for the client.
Let’s see the way to read data from HDFS.
Using HDFS command:
With the help of the below command, we can directly read data from HDFS(NOTE: Make sure all of your Hadoop daemons are running).
Commands to start Hadoop Daemons
Syntax For Reading Data From HDFS:
hdfs dfs -get <source-path> <destination-path> # here source path is file path on HDFS that we want to read
# destination path is where we want to store the readed file on local machine
In our case, we have one file with the name dikshant.txt with some data on the HDFS root directory. The below command, we can use to list data on the HDFS root directory.
hdfs dfs -ls /
the below command will read the data from the root directory of HDFS and stores it in the /home/dikshant/Desktop location on my local machine.
hdfs dfs -get /dikshant.txt /home/dikshant/Desktop
In the below image we can observe that the data is successfully read and stored in /home/dikshant/Desktop directory and now we can see the content of it by opening this file.