Introduction to Hadoop
What is Hadoop?
Hadoop is an open source software programming framework for storing a large amount of data and performing the computation. Its framework is based on Java programming with some native code in C and shell scripts.
History of Hadoop
Apache Software Foundation is the developers of Hadoop, and it’s co-founders are Doug Cutting and Mike Cafarella.
It’s co-founder Doug Cutting named it on his son’s toy elephant. In October 2003 the first paper release was Google File System. In January 2006, MapReduce development started on the Apache Nutch which consisted of around 6000 lines coding for it and around 5000 lines coding for HDFS. In April 2006 Hadoop 0.1.0 was released.
Hadoop Distributed File System
It has distributed file system known as HDFS and this HDFS splits files into blocks and sends them across various nodes in form of large clusters. Also in case of a node failure, the system operates and data transfer takes place between the nodes which are facilitated by HDFS.
Advantages of HDFS:
It is inexpensive, immutable in nature, stores data reliably, ability to tolerate faults, scalable, block structured, can process a large amount of data simultaneously and many more.
Disadvantages of HDFS:
It’s the biggest disadvantage is that it is not fit for small quantities of data. Also, it has issues related to potential stability, restrictive and rough in nature.
Hadoop also supports a wide range of software packages such as Apache Flumes, Apache Oozie, Apache HBase, Apache Sqoop, Apache Spark, Apache Storm, Apache Pig, Apache Hive, Apache Phoenix, Cloudera Impala.
Some common frameworks of Hadoop
- Hive- It uses HiveQl for data structuring and for writing complicated MapReduce in HDFS.
- Drill- It consists of user-defined functions and is used for data exploration.
- Storm- It allows real-time processing and streaming of data.
- Spark- It contains a Machine Learning Library(MLlib) for providing enhanced machine learning and is widely used for data processing. It also supports Java, Python, and Scala.
- Pig- It has Pig Latin, a SQL-Like language and performs data transformation of unstructured data.
- Tez- It reduces the complexities of Hive and Pig and helps in the running of their codes faster.
Hadoop framework is made up of the following modules:
- Hadoop MapReduce- a MapReduce programming model for handling and processing large data.
- Hadoop Distributed File System- distributed files in clusters among nodes.
- Hadoop YARN- a platform which manages computing resources.
- Hadoop Common- it contains packages and libraries which are used for other modules.
Advantages and Disadvantages of Hadoop
- Ability to store a large amount of data.
- High flexibility.
- Cost effective.
- High computational power.
- Tasks are independent.
- Linear scaling.
- Not very effective for small data.
- Hard cluster management.
- Has stability issues.
- Security concerns.