All of us are familiar with the disaster that happened on April 14, 1912. The big giant ship of 46000-ton in weight got sink-down to the depth of 13,000 feet in the North Atlantic Ocean. Our aim is to analyze the data obtained after this disaster. Hadoop MapReduce can be utilized to deal with this large datasets efficiently to find any solution for a particular problem.
Problem Statement: Analyzing the Titanic Disaster dataset, for finding the average age of male and female persons died in this disaster with MapReduce Hadoop.
We can download the Titanic Dataset from this Link. Below is the column structure of our Titanic dataset. It consist of 12 columns where each row describes the information of a particular person.
The first 10 records of the dataset is shown below.
Make the project in Eclipse with below steps:
- First Open Eclipse -> then select File -> New -> Java Project ->Name it Titanic_Data_Analysis -> then select use an execution environment -> choose JavaSE-1.8 then next -> Finish.
- In this Project Create Java class with name Average_age -> then click Finish
- Copy the below source code to this Average_age java class
- Now we need to add external jar for the packages that we have import. Download the jar package Hadoop Common and Hadoop MapReduce Core according to your Hadoop version.
Check Hadoop Version :
- Now we add these external jars to our Titanic_Data_Analysis project. Right Click on Titanic_Data_Analysis -> then select Build Path-> Click on Configue Build Path and select Add External jars…. and add jars from it’s download location then click -> Apply and Close.
- Now export the project as jar file. Right-click on Titanic_Data_Analysis choose Export.. and go to Java -> JAR file click -> Next and choose your export destination then click -> Next. Choose Main Class as Average_age by clicking -> Browse and then click -> Finish -> Ok.
Start Hadoop Daemons
Then, Check Running Hadoop Daemons.
Move your dataset to the Hadoop HDFS.
hdfs dfs -put /file_path /destination
In below command / shows the root directory of our HDFS.
hdfs dfs -put /home/dikshant/Documents/titanic_data.txt /
Check the file sent to our HDFS.
hdfs dfs -ls /
Now Run your Jar File with below command and produce the output in Titanic_Output File.
hadoop jar /jar_file_location /dataset_location_in_HDFS /output-file_name
hadoop jar /home/dikshant/Documents/Average_age.jar /titanic_data.txt /Titanic_Output
Now Move to localhost:50070/, under utilities select Browse the file system and download part-r-00000 in /MyOutput directory to see result.
Note: We can also view the result with below command
hdfs dfs -cat /Titanic_Output/part-r-00000
In the above image, we can see that the average age of the female is 28 and male is 30 according to our dataset who died in the Titanic Disaster.