Hadoop’s MapReduce framework provides the facility to cache small to moderate read-only files such as text files, zip files, jar files etc. and broadcast them to all the Datanodes(worker-nodes) where MapReduce job is running. Each Datanode gets a copy of the file(local-copy) which is sent through Distributed Cache. When the job is finished these files are deleted from the DataNodes.
Why to cache a file?
There are some files which are required by MapReduce jobs so rather than reading every time from HDFS (increases seek time thus latency) for let’s say 100 times (if 100 Mappers are running) we just send the copy of the file to all the Datanode once.
Let’s see an example where we count the words from lyrics.txt except the words present in stopWords.txt. You can find these files in here.
1. Copy both the files from the local filesystem to HDFS.
bin/hdfs dfs -put ../Desktop/lyrics.txt /geeksInput // this file will be cached bin/hdfs dfs -put ../Desktop/stopWords.txt /cached_Geeks
2. Get the NameNode server address. Since the file has to be accessed via URI(Uniform Resource Identifier) we need this address. It can be found in core-site.xml
In my PC it’s hdfs://localhost:9000 it may vary in your PC.
How to Execute the Code?
- Export the project as a jar file and copy to your Ubuntu desktop as distributedExample.jar
- Start your Hadoop services. Go inside hadoop_home_dir and in terminal type
- Run the jar file
bin/yarn jar jar_file_path packageName.Driver_Class_Name inputFilePath outputFilePath
bin/yarn jar ../Desktop/distributedExample.jar word_count_DC.Driver /geeksInput /geeksOutput
// will print the words starting with t bin/hdfs dfs -cat /geeksOutput/part* | grep ^t
In the output, we can observe there is no the or to words which we wanted to ignore.
- Difference between Hadoop 1 and Hadoop 2
- Introduction to Hadoop
- Hadoop - Introduction
- Introduction to Hadoop Distributed File System(HDFS)
- Hadoop | History or Evolution
- How to find top-N records using MapReduce
- Hadoop YARN Architecture
- Hadoop Ecosystem
- Map Reduce in Hadoop
- Sum of even and odd numbers in MapReduce using Cloudera Distribution Hadoop(CDH)
- How to Execute WordCount Program in MapReduce using Cloudera Distribution Hadoop(CDH)
- Volunteer and Grid Computing | Hadoop
- How MapReduce handles data query ?
- RDMS vs Hadoop
- MapReduce Job Execution
If you like GeeksforGeeks and would like to contribute, you can also write an article using contribute.geeksforgeeks.org or mail your article to firstname.lastname@example.org. See your article appearing on the GeeksforGeeks main page and help other Geeks.
Please Improve this article if you find anything incorrect by clicking on the "Improve Article" button below.