Hadoop Streaming is a feature that comes with Hadoop and allows users or developers to use various different languages for writing MapReduce programs like Python, C++, Ruby, etc. It supports all the languages that can read from standard input and write to standard output. We will be implementing Python with Hadoop Streaming and will observe how it works. We will implement the word count problem in python to understand Hadoop Streaming. We will be creating mapper.py and reducer.py to perform map and reduce tasks.
Let’s create one file which contains multiple words that we can count.
Step 1: Create a file with the name word_count_data.txt and add some data to it.
cd Documents/ # to change the directory to /Documents touch word_count_data.txt # touch is used to create an empty file nano word_count_data.txt # nano is a command line editor to edit the file cat word_count_data.txt # cat is used to see the content of the file
Step 2: Create a mapper.py file that implements the mapper logic. It will read the data from STDIN and will split the lines into words, and will generate an output of each word with its individual count.
cd Documents/ # to change the directory to /Documents touch mapper.py # touch is used to create an empty file cat mapper.py # cat is used to see the content of the file
Copy the below code to the mapper.py file.
Here in the above program #! is known as shebang and used for interpreting the script. The file will be run using the command we are specifying.
Let’s test our mapper.py locally that it is working fine or not.
cat <text_data_file> | python <mapper_code_python_file>
Command(in my case)
cat word_count_data.txt | python mapper.py
The output of the mapper is shown below.
Step 3: Create a reducer.py file that implements the reducer logic. It will read the output of mapper.py from STDIN(standard input) and will aggregate the occurrence of each word and will write the final output to STDOUT.
cd Documents/ # to change the directory to /Documents touch reducer.py # touch is used to create an empty file
Now let’s check our reducer code reducer.py with mapper.py is it working properly or not with the help of the below command.
cat word_count_data.txt | python mapper.py | sort -k1,1 | python reducer.py
We can see that our reducer is also working fine in our local system.
Step 4: Now let’s start all our Hadoop daemons with the below command.
Now make a directory word_count_in_python in our HDFS in the root directory that will store our word_count_data.txt file with the below command.
hdfs dfs -mkdir /word_count_in_python
Copy word_count_data.txt to this folder in our HDFS with help of copyFromLocal command.
Syntax to copy a file from your local file system to the HDFS is given below:
hdfs dfs -copyFromLocal /path 1 /path 2 .... /path n /destination
Actual command(in my case)
hdfs dfs -copyFromLocal /home/dikshant/Documents/word_count_data.txt /word_count_in_python
Now our data file has been sent to HDFS successfully. we can check whether it sends or not by using the below command or by manually visiting our HDFS.
hdfs dfs -ls / # list down content of the root directory hdfs dfs -ls /word_count_in_python # list down content of /word_count_in_python directory
Let’s give executable permission to our mapper.py and reducer.py with the help of below command.
cd Documents/ chmod 777 mapper.py reducer.py # changing the permission to read, write, exectute for user, group and others
In below image,Then we can observe that we have changed the file permission.
Step 5: Now download the latest hadoop-streaming jar file from this Link. Then place, this Hadoop,-streaming jar file to a place from you can easily access it. In my case, I am placing it to /Documents folder where mapper.py and reducer.py file is present.
Now let’s run our python files with the help of the Hadoop streaming utility as shown below.
hadoop jar /home/dikshant/Documents/hadoop-streaming-2.7.3.jar \ > -input /word_count_in_python/word_count_data.txt \ > -output /word_count_in_python/output \ > -mapper /home/dikshant/Documents/mapper.py \ > -reducer /home/dikshant/Documents/reducer.py
In the above command in -output, we will specify the location in HDFS where we want our output to be stored. So let’s check our output in output file at location /word_count_in_python/output/part-00000 in my case. We can check results by manually vising the location in HDFS or with the help of cat command as shown below.
hdfs dfs -cat /word_count_in_python/output/part-00000
Basic options that we can use with Hadoop Streaming
|-mapper||The command to be run as the mapper|
|-reducer||The command to be run as the reducer|
|-input||The DFS input path for the Map step|
|-output||The DFS output directory for the Reduce step|
- Difference between Hadoop 1 and Hadoop 2
- Difference Between Hadoop 2.x vs Hadoop 3.x
- Hadoop - HDFS (Hadoop Distributed File System)
- Hadoop - Features of Hadoop Which Makes It Popular
- How to Execute Character Count Program in MapReduce Hadoop?
- Sum of even and odd numbers in MapReduce using Cloudera Distribution Hadoop(CDH)
- How to Execute WordCount Program in MapReduce using Cloudera Distribution Hadoop(CDH)
- Hadoop - Python Snakebite CLI Client, Its Usage and Command References
- Snakebite Python Package For Hadoop HDFS
- Hadoop - mrjob Python Library For MapReduce With Example
- Introduction to Hadoop
- Hadoop - Introduction
- Introduction to Hadoop Distributed File System(HDFS)
- Hadoop | History or Evolution
- Hadoop YARN Architecture
- Hadoop Ecosystem
- Map Reduce in Hadoop
- Distributed Cache in Hadoop MapReduce
- Volunteer and Grid Computing | Hadoop
- RDMS vs Hadoop
If you like GeeksforGeeks and would like to contribute, you can also write an article using contribute.geeksforgeeks.org or mail your article to firstname.lastname@example.org. See your article appearing on the GeeksforGeeks main page and help other Geeks.
Please Improve this article if you find anything incorrect by clicking on the "Improve Article" button below.