Introduction to Hadoop

What is Hadoop?

Hadoop is an open source software programming framework for storing a large amount of data and performing the computation. Its framework is based on Java programming with some native code in C and shell scripts.

History of Hadoop

Apache Software Foundation is the developers of Hadoop, and it’s co-founders are Doug Cutting and Mike Cafarella.
It’s co-founder Doug Cutting named it on his son’s toy elephant. In October 2003 the first paper release was Google File System. In January 2006, MapReduce development started on the Apache Nutch which consisted of around 6000 lines coding for it and around 5000 lines coding for HDFS. In April 2006 Hadoop 0.1.0 was released.

Hadoop Distributed File System

It has distributed file system known as HDFS and this HDFS splits files into blocks and sends them across various nodes in form of large clusters. Also in case of a node failure, the system operates and data transfer takes place between the nodes which are facilitated by HDFS.

HDFS

Advantages of HDFS:
It is inexpensive, immutable in nature, stores data reliably, ability to tolerate faults, scalable, block structured, can process a large amount of data simultaneously and many more.

Disadvantages of HDFS:
It’s the biggest disadvantage is that it is not fit for small quantities of data. Also, it has issues related to potential stability, restrictive and rough in nature.

Hadoop also supports a wide range of software packages such as Apache Flumes, Apache Oozie, Apache HBase, Apache Sqoop, Apache Spark, Apache Storm, Apache Pig, Apache Hive, Apache Phoenix, Cloudera Impala.

Some common frameworks of Hadoop

  1. Hive- It uses HiveQl for data structuring and for writing complicated MapReduce in HDFS.
  2. Drill- It consists of user-defined functions and is used for data exploration.
  3. Storm- It allows real-time processing and streaming of data.
  4. Spark- It contains a Machine Learning Library(MLlib) for providing enhanced machine learning and is widely used for data processing. It also supports Java, Python, and Scala.
  5. Pig- It has Pig Latin, a SQL-Like language and performs data transformation of unstructured data.
  6. Tez- It reduces the complexities of Hive and Pig and helps in the running of their codes faster.

Hadoop framework is made up of the following modules:

  1. Hadoop MapReduce- a MapReduce programming model for handling and processing large data.
  2. Hadoop Distributed File System- distributed files in clusters among nodes.
  3. Hadoop YARN- a platform which manages computing resources.
  4. Hadoop Common- it contains packages and libraries which are used for other modules.

Advantages and Disadvantages of Hadoop

Advantages:

  • Ability to store a large amount of data.
  • High flexibility.
  • Cost effective.
  • High computational power.
  • Tasks are independent.
  • Linear scaling.

Disadvantages:

  • Not very effective for small data.
  • Hard cluster management.
  • Has stability issues.
  • Security concerns.


My Personal Notes arrow_drop_up

Check out this Author's contributed articles.

If you like GeeksforGeeks and would like to contribute, you can also write an article using contribute.geeksforgeeks.org or mail your article to contribute@geeksforgeeks.org. See your article appearing on the GeeksforGeeks main page and help other Geeks.

Please Improve this article if you find anything incorrect by clicking on the "Improve Article" button below.



Improved By : Madhurkant Sharma



Article Tags :
Practice Tags :


Be the First to upvote.


Please write to us at contribute@geeksforgeeks.org to report any issue with the above content.