Map-Reduce is a programming model that is mainly divided into two phases i.e. Map Phase and Reduce Phase. It is designed for processing the data in parallel which is divided on various machines(nodes). The Hadoop Java programs are consist of Mapper class and Reducer class along with the driver class. Reducer is the second part of the Map-Reduce programming model. The Mapper produces the output in the form of key-value pairs which works as input for the Reducer.
But before sending this intermediate key-value pairs directly to the Reducer some process will be done which shuffle and sort the key-value pairs according to its key values, which means the value of the key is the main decisive factor for sorting. The output generated by the Reducer will be the final output which is then stored on HDFS(Hadoop Distributed File System). Reducer mainly performs some computation operation like addition, filtration, and aggregation. By default, the number of reducers utilized for process the output of the Mapper is 1 which is configurable and can be changed by the user according to the requirement.
Let’s understand the Reducer in Map-Reduce:
Here, in the above image, we can observe that there are multiple Mapper which are generating the key-value pairs as output. The output of each mapper is sent to the sorter which will sort the key-value pairs according to its key value. Shuffling also takes place during the sorting process and the output will be sent to the Reducer part and final output is produced.
Let’s take an example to understand the working of Reducer. Suppose we have the data of a college faculty of all departments stored in a CSV file. In case we want to find the sum of salaries of faculty according to their department then we can make their dept. title as key and salaries as value. The Reducer will perform the summation operation on this dataset and produce the desired output.
The number of Reducers in Map-Reduce task also affects below features:
- Framework overhead increases.
- Cost of failure Reduces
- Increase load balancing.
One thing we also need to remember is that there will always be a one to one mapping between Reducers and the keys. Once the whole Reducer process is done the output is stored at the part file(default name) on HDFS(Hadoop Distributed File System). In the output directory on HDFS, The Map-Reduce always makes a _SUCCESS file and part-r-00000 file. The number of part files depends on the number of reducers in case we have 5 Reducers then the number of the part file will be from part-r-00000 to part-r-00004. By default, these files have the name of part-a-bbbbb type. It can be changed manually all we need to do is to change the below property in our driver code of Map-Reduce.
// Here we are changing output file name from part-r-00000 to GeeksForGeeks
The Reducer Of Map-Reduce is consist of mainly 3 processes/phases:
- Shuffle: Shuffling helps to carry data from the Mapper to the required Reducer. With the help of HTTP, the framework calls for applicable partition of the output in all Mappers.
- Sort: In this phase, the output of the mapper that is actually the key-value pairs will be sorted on the basis of its key value.
- Reduce: Once shuffling and sorting will be done the Reducer combines the obtained result and perform the computation operation as per the requirement. OutputCollector.collect() property is used for writing the output to the HDFS. Keep remembering that the output of the Reducer will not be sorted.
Note: Shuffling and Sorting both execute in parallel.
Setting Number Of Reducers In Map-Reduce:
- With Command Line: While executing our Map-Reduce program we can manually change the number of Reducer with controller mapred.reduce.tasks.
- With JobConf instance: In our driver class, we can specify the number of reducers using the instance of job.setNumReduceTasks(int).
For example job.setNumReduceTasks(2), Here we have 2 Reducers. we can also make Reducers to 0 in case we need only a Map job.
// Ideally The number of Reducers in a Map-Reduce must be set to:
0.95 or 1.75 multiplied by (<no. of nodes> * <no. of maximum containers per node>)