Distributed Cache in Hadoop MapReduce

Last Updated : 19 May, 2019

Hadoop’s MapReduce framework provides the facility to cache small to moderate read-only files such as text files, zip files, jar files etc. and broadcast them to all the Datanodes(worker-nodes) where MapReduce job is running. Each Datanode gets a copy of the file(local-copy) which is sent through Distributed Cache. When the job is finished these files are deleted from the DataNodes.

Why to cache a file?

There are some files which are required by MapReduce jobs so rather than reading every time from HDFS (increases seek time thus latency) for let’s say 100 times (if 100 Mappers are running) we just send the copy of the file to all the Datanode once.

Let’s see an example where we count the words from lyrics.txt except the words present in stopWords.txt. You can find these files in here.

Prerequisites:

1. Copy both the files from the local filesystem to HDFS.

bin/hdfs dfs -put ../Desktop/lyrics.txt  /geeksInput

// this file will be cached
bin/hdfs dfs -put ../Desktop/stopWords.txt /cached_Geeks

2. Get the NameNode server address. Since the file has to be accessed via URI(Uniform Resource Identifier) we need this address. It can be found in core-site.xml

Hadoop_Home_dir/etc/hadoop/core-site.xml

In my PC it’s hdfs://localhost:9000 it may vary in your PC.

Mapper Code:

package word_count_DC; 
  
import java.io.*; 
import java.util.*; 
import java.net.URI; 
  
import org.apache.hadoop.io.Text; 
import org.apache.hadoop.io.LongWritable; 
import org.apache.hadoop.mapreduce.Mapper; 
import org.apache.hadoop.fs.FileSystem; 
import org.apache.hadoop.fs.Path; 
  
public class Cached_Word_Count extends Mapper<LongWritable,  
                                Text, Text, LongWritable> { 
  
    ArrayList<String> stopWords = null; 
  
    public void setup(Context context) throws IOException,  
                                     InterruptedException 
    { 
        stopWords = new ArrayList<String>(); 
  
        URI[] cacheFiles = context.getCacheFiles(); 
  
        if (cacheFiles != null && cacheFiles.length > 0)  
        { 
            try { 
  
                String line = ""; 
  
               // Create a FileSystem object and pass the  
               // configuration object in it. The FileSystem 
               // is an abstract base class for a fairly generic 
               // filesystem. All user code that may potentially  
               // use the Hadoop Distributed File System should 
               // be written to use a FileSystem object. 
                FileSystem fs = FileSystem.get(context.getConfiguration()); 
                Path getFilePath = new Path(cacheFiles[0].toString()); 
  
                // We open the file using FileSystem object,  
                // convert the input byte stream to character 
                // streams using InputStreamReader and wrap it  
                // in BufferedReader to make it more efficient 
                BufferedReader reader = new BufferedReader(new InputStreamReader(fs.open(getFilePath))); 
  
                while ((line = reader.readLine()) != null)  
                { 
                    String[] words = line.split(" "); 
  
                    for (int i = 0; i < words.length; i++)  
                    { 
                        // add the words to ArrayList 
                        stopWords.add(words[i]);  
                    } 
                } 
            } 
  
            catch (Exception e) 
            { 
                System.out.println("Unable to read the File"); 
                System.exit(1); 
            } 
        } 
    } 
  
    public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException 
    { 
        String words[] = value.toString().split(" "); 
  
        for (int i = 0; i < words.length; i++)  
        { 
  
            // removing all special symbols  
            // and converting it to lowerCase 
            String temp = words[i].replaceAll("[?, '()]", "").toLowerCase(); 
  
            // if not present in ArrayList we write 
            if (!stopWords.contains(temp))  
            { 
                context.write(new Text(temp), new LongWritable(1)); 
            } 
        } 
    } 
} 

Reducer Code:

package word_count_DC; 
  
import java.io.*; 
import org.apache.hadoop.io.Text; 
import org.apache.hadoop.io.LongWritable; 
import org.apache.hadoop.mapreduce.Reducer; 
  
public class Cached_Reducer extends Reducer<Text,  
              LongWritable, Text, LongWritable> { 
  
    public void reduce(Text key, Iterable<LongWritable> values, 
        Context context) throws IOException, InterruptedException 
    { 
        long sum = 0; 
  
        for (LongWritable val : values) 
        { 
            sum += val.get(); 
        } 
  
        context.write(key, new LongWritable(sum)); 
    } 
} 

Driver Code:

package word_count_DC; 
  
import java.io.*; 
import java.net.URI; 
import org.apache.hadoop.conf.Configuration; 
import org.apache.hadoop.fs.Path; 
import org.apache.hadoop.io.Text; 
import org.apache.hadoop.io.LongWritable; 
import org.apache.hadoop.mapreduce.Job; 
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; 
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; 
import org.apache.hadoop.util.GenericOptionsParser; 
  
public class Driver { 
  
    public static void main(String[] args) throws IOException,  
                 InterruptedException, ClassNotFoundException 
    { 
  
        Configuration conf = new Configuration(); 
        String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs(); 
  
        if (otherArgs.length != 2)  
        { 
            System.err.println("Error: Give only two paths for <input> <output>"); 
            System.exit(1); 
        } 
  
        Job job = Job.getInstance(conf, "Distributed Cache"); 
  
        job.setJarByClass(Driver.class); 
        job.setMapperClass(Cached_Word_Count.class); 
        job.setReducerClass(Cached_Reducer.class); 
  
        job.setMapOutputKeyClass(Text.class); 
        job.setMapOutputValueClass(LongWritable.class); 
  
        job.setOutputKeyClass(Text.class); 
        job.setOutputValueClass(LongWritable.class); 
  
        try { 
  
            // the complete URI(Uniform Resource  
            // Identifier) file path in Hdfs 
            job.addCacheFile(new URI("hdfs://localhost:9000/cached_Geeks/stopWords.txt")); 
        } 
        catch (Exception e) { 
            System.out.println("File Not Added"); 
            System.exit(1); 
        } 
  
        FileInputFormat.addInputPath(job, new Path(args[0])); 
  
        FileOutputFormat.setOutputPath(job, new Path(args[1])); 
  
        // throws ClassNotFoundException, so handle it 
        System.exit(job.waitForCompletion(true) ? 0 : 1);  
    } 
} 

How to Execute the Code?

Export the project as a jar file and copy to your Ubuntu desktop as distributedExample.jar
Start your Hadoop services. Go inside hadoop_home_dir and in terminal type
```
sbin/start-all.sh
```
Run the jar file

bin/yarn jar jar_file_path packageName.Driver_Class_Name inputFilePath outputFilePath

bin/yarn jar ../Desktop/distributedExample.jar word_count_DC.Driver /geeksInput /geeksOutput

Output:
```
// will print the words starting with t

bin/hdfs dfs -cat /geeksOutput/part* | grep ^t
```
In the output, we can observe there is no the or to words which we wanted to ignore.

Suggest improvement

Anatomy of File Read and Write in HDFS

ML | K-Medoids clustering with solved example

Share your thoughts in the comments

Distributed Cache in Hadoop MapReduce

Please Login to comment...

Similar Reads

What kind of Experience do you want to share?