Map Reduce in Hadoop

One of the three components of Hadoop is Map Reduce. The first component of Hadoop that is, Hadoop Distributed File System (HDFS) is responsible for storing the file. The second component that is, Map Reduce is responsible for processing the file.

Suppose there is a word file containing some text. Let us name this file as sample.txt. Note that we use Hadoop to deal with huge files but for the sake of easy explanation over here, we are taking a text file as an example. So, let’s assume that this sample.txt file contains few lines as text. The content of the file is as follows:

Hello I am GeeksforGeeks
How can I help you
How can I assist you
Are you an engineer
Are you looking for coding
Are you looking for interview questions
what are you doing these days
what are your strengths

Hence, the above 8 lines are the content of the file. Let’s assume that while storing this file in Hadoop, HDFS broke this file into four parts and named each part as first.txt, second.txt, third.txt, and fourth.txt. So, you can easily see that the above file will be divided into four equal parts and each part will contain 2 lines. First two lines will be in the file first.txt, next two lines in second.txt, next two in third.txt and the last two lines will be stored in fourth.txt. All these files will be stored in Data Nodes and the Name Node will contain the metadata about them. All this is the task of HDFS.

Now, suppose a user wants to process this file. Here is what Map-Reduce comes into the picture. Suppose this user wants to run a query on this sample.txt. So, instead of bringing sample.txt on the local computer, we will send this query on the data. To keep a track of our request, we use Job Tracker (a master service). Job Tracker traps our request and keeps a track of it.

Now suppose that the user wants to run his query on sample.txt and want the output in result.output file. Let the name of the file containing the query is query.jar. So, the user will write a query like:



J$hadoop jar query.jar DriverCode sample.txt result.output
  1. query.jar : query file that needs to be processed on the input file.
  2. sample.txt: input file.
  3. result.output: directory in which output of the processing will be received.

So, now the Job Tracker traps this request and asks Name Node to run this request on sample.txt. Name Node then provides the metadata to the Job Tracker. Job Tracker now knows that sample.txt is stored in first.txt, second.txt, third.txt, and fourth.txt. As all these four files have three copies stored in HDFS, so the Job Tracker communicates with the Task Tracker (a slave service) of each of these files but it communicates with only one copy of each file which is residing nearest to it.

Note: Applying the desired code on local first.txt, second.txt, third.txt and fourth.txt is a process., This process is called Map.

In Hadoop terminology, the main file sample.txt is called input file and its four subfiles are called input splits. So, in Hadoop the number of mappers for an input file are equal to number of input splits of this input file. In the above case, the input file sample.txt has four input splits hence four mappers will be running to process it. The responsibility of handling these mappers is of Job Tracker.

Note that the task trackers are slave services to the Job Tracker. So, in case any of the local machines breaks down then the processing over that part of the file will stop and it will halt the complete process. So, each task tracker sends heartbeat and its number of slots to Job Tracker in every 3 seconds. This is called the status of Task Trackers. In case any task tracker goes down, the Job Tracker then waits for 10 heartbeat times, that is, 30 seconds, and even after that if it does not get any status, then it assumes that either the task tracker is dead or is extremely busy. So it then communicates with the task tracker of another copy of the same file and directs it to process the desired code over it. Similarly, the slot information is used by the Job Tracker to keep a track of how many tasks are being currently served by the task tracker and how many more tasks can be assigned to it. In this way, the Job Tracker keeps track of our request.
Now, suppose that the system has generated output for individual first.txt, second.txt, third.txt, and fourth.txt. But this is not the user’s desired output. To produce the desired output, all these individual outputs have to be merged or reduced to a single output. This reduction of multiple outputs to a single one is also a process which is done by REDUCER. In Hadoop, as many reducers are there, those many number of output files are generated. By default, there is always one reducer per cluster.

Note: Map and Reduce are two different processes of the second component of Hadoop, that is, Map Reduce. These are also called phases of Map Reduce. Thus we can say that Map Reduce has two phases. Phase 1 is Map and Phase 2 is Reduce.

Functioning of Map Reduce

Now, let us move back to our sample.txt file with the same content. Again it is being divided into four input splits namely, first.txt, second.txt, third.txt, and fourth.txt. Now, suppose we want to count number of each word in the file. That is the content of the file looks like:

Hello I am GeeksforGeeks
How can I help you
How can I assist you
Are you an engineer
Are you looking for coding
Are you looking for interview questions
what are you doing these days
what are your strengths

Then the output of the ‘word count’ code will be like:

Hello - 1
I - 1
am - 1
geeksforgeeks - 1
How - 2 (How is written two times in the entire file) 
Similarly
Are - 3
are - 2
….and so on

Thus in order to get this output, the user will have to send his query on the data. Suppose the query ‘word count’ is in the file wordcount.jar. So, the query will look like:



J$hadoop jar wordcount.jar DriverCode sample.txt result.output
Types of File Format in Hadoop

Now, as we know that there are four input splits, so four mappers will be running. One on each input split. But, Mappers don’t run directly on the input splits. It is because the input splits contain text but mappers don’t understand the text. Mappers understand (key, value) pairs only. Thus the text in input splits first needs to be converted to (key, value) pairs. This is achieved by Record Readers. Thus we can also say that as many numbers of input splits are there, those many numbers of record readers are there.

In Hadoop terminology, each line in a text is termed as a ‘record’. How record reader converts this text into (key, value) pair depends on the format of the file. In Hadoop, there are four formats of a file. These formats are Predefined Classes in Hadoop.

Four types of formats are:

  1. TextInputFormat
  2. KeyValueTextInputFormat
  3. SequenceFileInputFormat
  4. SequenceFileAsTextInputFormat

By default, a file is in TextInputFormat. Record reader reads one record(line) at a time. While reading, it doesn’t consider the format of the file. But, it converts each record into (key, value) pair depending upon its format. For the time being, let’s assume that the first input split first.txt is in TextInputFormat. Now, the record reader working on this input split converts the record in the form of (byte offset, entire line). For example first.txt has the content:

Hello I am GeeksforGeeks
How can I help you

So, the output of record reader has two pairs (since two records are there in the file). The first pair looks like (0, Hello I am geeksforgeeks) and the second pair looks like (26, How can I help you). Note that the second pair has the byte offset of 26 because there are 25 characters in the first line and the newline operator (\n) is also considered a character. Thus, after the record reader as many numbers of records is there, those many numbers of (key, value) pairs are there. Now, the mapper will run once for each of these pairs. Similarly, other mappers are also running for (key, value) pairs of different input splits. Thus in this way, Hadoop breaks a big task into smaller tasks and executes them in parallel execution.

Shuffling and Sorting

Now, the mapper provides an output corresponding to each (key, value) pair provided by the record reader. Let us take the first input split of first.txt. The two pairs so generated for this file by the record reader are (0, Hello I am GeeksforGeeks) and (26, How can I help you). Now mapper takes one of these pair at a time and produces output like (Hello, 1), (I, 1), (am, 1) and (GeeksforGeeks, 1) for the first pair and (How, 1), (can, 1), (I, 1), (help, 1) and (you, 1) for the second pair. Similarly, we have outputs of all the mappers. Note that this data contains duplicate keys like (I, 1) and further (how, 1) etc. These duplicate keys also need to be taken care of. This data is also called Intermediate Data. Before passing this intermediate data to the reducer, it is first passed through two more stages, called Shuffling and Sorting.

  1. Shuffling Phase: This phase combines all values associated to an identical key. For eg, (Are, 1) is there three times in the input file. So after the shuffling phase, the output will be like (Are, [1,1,1]).
  2. Sorting Phase: Once shuffling is done, the output is sent to the sorting phase where all the (key, value) pairs are sorted automatically. In Hadoop sorting is an automatic process because of the presence of an inbuilt interface called WritableComparableInterface.

After the completion of the shuffling and sorting phase, the resultant output is then sent to the reducer. Now, if there are n (key, value) pairs after the shuffling and sorting phase, then the reducer runs n times and thus produces the final result in which the final processed output is there. In the above case, the resultant output after the reducer processing will get stored in the directory result.output as specified in the query code written to process the query on the data.




My Personal Notes arrow_drop_up

Check out this Author's contributed articles.

If you like GeeksforGeeks and would like to contribute, you can also write an article using contribute.geeksforgeeks.org or mail your article to contribute@geeksforgeeks.org. See your article appearing on the GeeksforGeeks main page and help other Geeks.

Please Improve this article if you find anything incorrect by clicking on the "Improve Article" button below.


Article Tags :
Practice Tags :


2


Please write to us at contribute@geeksforgeeks.org to report any issue with the above content.