Hadoop’s MapReduce framework provides the facility to cache small to moderate read-only files such as text files, zip files, jar files etc. and broadcast them to all the Datanodes(worker-nodes) where MapReduce job is running. Each Datanode gets a copy of the file(local-copy) which is sent through Distributed Cache. When the job is finished these files are deleted from the DataNodes.
Why to cache a file?
There are some files which are required by MapReduce jobs so rather than reading every time from HDFS (increases seek time thus latency) for let’s say 100 times (if 100 Mappers are running) we just send the copy of the file to all the Datanode once.
Let’s see an example where we count the words from lyrics.txt except the words present in stopWords.txt. You can find these files in here.
1. Copy both the files from the local filesystem to HDFS.
bin/hdfs dfs -put ../Desktop/lyrics.txt /geeksInput // this file will be cached bin/hdfs dfs -put ../Desktop/stopWords.txt /cached_Geeks
2. Get the NameNode server address. Since the file has to be accessed via URI(Uniform Resource Identifier) we need this address. It can be found in core-site.xml
In my PC it’s hdfs://localhost:9000 it may vary in your PC.
How to Execute the Code?
- Export the project as a jar file and copy to your Ubuntu desktop as distributedExample.jar
- Start your Hadoop services. Go inside hadoop_home_dir and in terminal type
- Run the jar file
bin/yarn jar jar_file_path packageName.Driver_Class_Name inputFilePath outputFilePath
bin/yarn jar ../Desktop/distributedExample.jar word_count_DC.Driver /geeksInput /geeksOutput
// will print the words starting with t bin/hdfs dfs -cat /geeksOutput/part* | grep ^t
In the output, we can observe there is no the or to words which we wanted to ignore.