MapReduce is a programming model used to perform distributed processing in parallel in a Hadoop cluster, which Makes Hadoop working so fast. When you are dealing with Big Data, serial processing is no more of any use. MapReduce has mainly two tasks which are divided phase-wise:
- Map Task
- Reduce Task
Let us understand it with a real-time example, and the example helps you understand Mapreduce Programming Model in a story manner:
- Suppose the Indian government has assigned you the task to count the population of India. You can demand all the resources you want, but you have to do this task in 4 months. Calculating the population of such a large country is not an easy task for a single person(you). So what will be your approach?.
- One of the ways to solve this problem is to divide the country by states and assign individual in-charge to each state to count the population of that state.
- Task Of Each Individual: Each Individual has to visit every home present in the state and need to keep a record of each house members as:
State_Name Member_House1 State_Name Member_House2 State_Name Member_House3 . . State_Name Member_House n . .
For Simplicity, we have taken only three states.
This is a simple Divide and Conquer approach and will be followed by each individual to count people in his/her state.
- Once they have counted each house member in their respective state. Now they need to sum up their results and need to send it to the Head-quarter at New Delhi.
- We have a trained officer at the Head-quarter to receive all the results from each state and aggregate them by each state to get the population of that entire state. and Now, with this approach, you are easily able to count the population of India by summing up the results obtained at Head-quarter.
- The Indian Govt. is happy with your work and the next year they asked you to do the same job in 2 months instead of 4 months. Again you will be provided with all the resources you want.
- Since the Govt. has provided you with all the resources, you will simply double the number of assigned individual in-charge for each state from one to two. For that divide each state in 2 division and assigned different in-charge for these two divisions as:
- Similarly, each individual in charge of its division will gather the information about members from each house and keep its record.
- We can also do the same thing at the Head-quarters, so let’s also divide the Head-quarter in two division as:
- Now with this approach, you can find the population of India in two months. But there is a small problem with this, we never want the divisions of the same state to send their result at different Head-quarters then, in that case, we have the partial population of that state in Head-quarter_Division1 and Head-quarter_Division2 which is inconsistent because we want consolidated population by the state, not the partial counting.
- One easy way to solve is that we can instruct all individuals of a state to either send there result to Head-quarter_Division1 or Head-quarter_Division2. Similarly, for all the states.
- Our problem has been solved, and you successfully did it in two months.
- Now, if they ask you to do this process in a month, you know how to approach the solution.
- Great, now we have a good scalable model that works so well. The model we have seen in this example is like the MapReduce Programming model. so now you must be aware that MapReduce is a programming model, not a programming language.
Now let’s discuss the phases and important things involved in our model.
1. Map Phase: The Phase where the individual in-charges are collecting the population of each house in their division is Map Phase.
- Mapper: Involved individual in-charge for calculating population
- Input Splits: The state or the division of the state
- Key-Value Pair: Output from each individual Mapper like the key is Rajasthan and value is 2
2. Reduce Phase: The Phase where you are aggregating your result
- Reducers: Individuals who are aggregating the actual result. Here in our example, the trained-officers. Each Reducer produce the output as a key-value pair
3. Shuffle Phase: The Phase where the data is copied from Mappers to Reducers is Shuffler’s Phase. It comes in between Map and Reduces phase. Now the Map Phase, Reduce Phase, and Shuffler Phase our the three main Phases of our Mapreduce.