Hadoop: Hadoop got its start as a Yahoo project in 2006, which became a top-level Apache open-source project afterwords. It’s a general-purpose form of distributed processing that has several components: the Hadoop Distributed File System (HDFS), stores files in a Hadoop-native format and parallelizes them across a cluster; YARN, a schedule that coordinates application runtimes; and MapReduce, the algorithm that actually processes the data in parallel. Hadoop is built in Java, and accessible through many programming languages, for writing MapReduce code, including Python, through a Thrift client.
It’s available either open-source through the Apache distribution, or through vendors such as Cloudera (the largest Hadoop vendor by size and scope), MapR, or HortonWorks.
Spark: Spark is a newer project, initially developed in 2012, at the AMPLab at UC Berkeley. It’s a top-level Apache project focused on processing data in parallel across a cluster, but the biggest difference is that it works in memory.
Whereas Hadoop reads and writes files to HDFS, Spark processes data in RAM using a concept known as an RDD, Resilient Distributed Dataset. Spark can run either in stand-alone mode, with a Hadoop cluster serving as the data source, or in conjunction with Mesos. In the latter scenario, the Mesos master replaces the Spark master or YARN for scheduling purposes.
Spark is structured around Spark Core, the engine that drives the scheduling, optimizations, and RDD abstraction, as well as connects Spark to the correct filesystem (HDFS, S3, RDBMS, or Elasticsearch). There are several libraries that operate on top of Spark Core, including Spark SQL, which allows you to run SQL-like commands on distributed data sets, MLLib for machine learning, GraphX for graph problems, and streaming which allows for the input of continually streaming log data.
Below is a table of differences between Spark and Hadoop:
|1.||Hadoop is an open source framework which uses a MapReduce algorithm||Spark is lightning fast cluster computing technology, which extends the MapReduce model to efficiently use with more type of computations.|
|2.||Hadoop’s MapReduce model reads and writes from a disk, thus slow down the processing speed||Spark reduces the number of read/write cycles to disk and store intermediate data in-memory, hence faster-processing speed.|
|3.||Hadoop is designed to handle batch processing efficiently||Spark is designed to handle real-time data efficiently.||4.||Hadoop is a high latency computing framework, which does not have an interactive mode||Spark is a low latency computing and can process data interactively.||5.||With Hadoop MapReduce, a developer can only process data in batch mode only||Spark can process real-time data, from real time events like twitter, facebook||6.||Hadoop is a cheaper option available while comparing it in terms of cost||Spark requires a lot of RAM to run in-memory, thus increasing the cluster and hence cost.|