Hadoop MapReduce – Data Flow
Map-Reduce is a processing framework used to process data over a large number of machines. Hadoop uses Map-Reduce to process the data distributed in a Hadoop cluster. Map-Reduce is not similar to the other regular processing framework like Hibernate, JDK, .NET, etc. All these previous frameworks are designed to use with a traditional system where the data is stored at a single location like Network File System, Oracle database, etc. But when we are processing big data the data is located on multiple commodity machines with the help of HDFS.
So when the data is stored on multiple nodes we need a processing framework where it can copy the program to the location where the data is present, Means it copies the program to all the machines where the data is present. Here the Map-Reduce came into the picture for processing the data on Hadoop over a distributed system. Hadoop has a major drawback of cross-switch network traffic which is due to the massive volume of data. Map-Reduce comes with a feature called Data-Locality. Data Locality is the potential to move the computations closer to the actual data location on the machines.
Since Hadoop is designed to work on commodity hardware it uses Map-Reduce as it is widely acceptable which provides an easy way to process data over multiple nodes. Map-Reduce is not the only framework for parallel processing. Nowadays Spark is also a popular framework used for distributed computing like Map-Reduce. We also have HAMA, MPI theses are also the different-different distributed processing framework.
Let’s Understand Data-Flow in Map-Reduce
Map Reduce is a terminology that comes with Map Phase and Reducer Phase. The map is used for Transformation while the Reducer is used for aggregation kind of operation. The terminology for Map and Reduce is derived from some functional programming languages like Lisp, Scala, etc. The Map-Reduce processing framework program comes with 3 main components i.e. our Driver code, Mapper(For Transformation), and Reducer(For Aggregation).
Let’s take an example where you have a file of 10TB in size to process on Hadoop. The 10TB of data is first distributed across multiple nodes on Hadoop with HDFS. Now we have to process it for that we have a Map-Reduce framework. So to process this data with Map-Reduce we have a Driver code which is called Job. If we are using Java programming language for processing the data on HDFS then we need to initiate this Driver class with the Job object. Suppose you have a car which is your framework than the start button used to start the car is similar to this Driver code in the Map-Reduce framework. We need to initiate the Driver code to utilize the advantages of this Map-Reduce Framework.
There are also Mapper and Reducer classes provided by this framework which are predefined and modified by the developers as per the organizations requirement.
Brief Working of Mapper
Mapper is the initial line of code that initially interacts with the input dataset. suppose, If we have 100 Data-Blocks of the dataset we are analyzing then, in that case, there will be 100 Mapper program or process that runs in parallel on machines(nodes) and produce there own output known as intermediate output which is then stored on Local Disk, not on HDFS. The output of the mapper act as input for Reducer which performs some sorting and aggregation operation on data and produces the final output.
Brief Working Of Reducer
Reducer is the second part of the Map-Reduce programming model. The Mapper produces the output in the form of key-value pairs which works as input for the Reducer. But before sending this intermediate key-value pairs directly to the Reducer some process will be done which shuffle and sort the key-value pairs according to its key values. The output generated by the Reducer will be the final output which is then stored on HDFS(Hadoop Distributed File System). Reducer mainly performs some computation operation like addition, filtration, and aggregation.
Steps of Data-Flow:
- At a time single input split is processed. Mapper is overridden by the developer according to the business logic and this Mapper run in a parallel manner in all the machines in our cluster.
- The intermediate output generated by Mapper is stored on the local disk and shuffled to the reducer to reduce the task.
- Once Mapper finishes their task the output is then sorted and merged and provided to the Reducer.
- Reducer performs some reducing tasks like aggregation and other compositional operation and the final output is then stored on HDFS in part-r-00000(created by default) file.