Skip to content
Related Articles

Related Articles

Difference Between Hadoop and Spark

View Discussion
Improve Article
Save Article
  • Difficulty Level : Easy
  • Last Updated : 20 Sep, 2022
View Discussion
Improve Article
Save Article

Apache Hadoop is a platform that got its start as a Yahoo project in 2006, which became a top-level Apache open-source project afterward. This framework handles large datasets in a distributed fashion. The Hadoop ecosystem is highly fault-tolerant and does not depend upon hardware to achieve high availability. This framework is designed with a vision to look for the failures at the application layer. It’s a general-purpose form of distributed processing that has several components: 

  • Hadoop Distributed File System (HDFS): This stores files in a Hadoop-native format and parallelizes them across a cluster. It manages the storage of large sets of data across a Hadoop Cluster. Hadoop can handle both structured and unstructured data. 
  • YARN: YARN is Yet Another Resource Negotiator. It is a schedule that coordinates application runtimes. 
  • MapReduce: It is the algorithm that actually processes the data in parallel to combine the pieces into the desired result. 
  • Hadoop Common: It is also known as Hadoop Core and it provides support to all other components it has a set of common libraries and utilities that all other modules depend on.

Hadoop is built in Java, and accessible through many programming languages, for writing MapReduce code, including Python, through a Thrift client.  It’s available either open-source through the Apache distribution, or through vendors such as Cloudera (the largest Hadoop vendor by size and scope), MapR, or HortonWorks. 

What is Spark?

Apache Spark is an open-source tool. It is a newer project, initially developed in 2012, at the AMPLab at UC Berkeley. It is focused on processing data in parallel across a cluster, but the biggest difference is that it works in memory. It is designed to use RAM for caching and processing the data. Spark performs different types of big data workloads like:

  • Batch processing.
  • Real-time stream processing. 
  • Machine learning.
  • Graph computation.
  • Interactive queries. 

There are five main components of Apache Spark:

  • Apache Spark Core: It is responsible for functions like scheduling, input and output operations, task dispatching, etc.
  • Spark SQL: This is used to gather information about structured data and how the data is processed.
  • Spark Streaming: This component enables the processing of live data streams. 
  • Machine Learning Library: The goal of this component is scalability and to make machine learning more accessible.
  • GraphX: This has a set of APIs that are used for facilitating graph analytics tasks.

Hadoop vs Spark

This section list the differences between Hadoop and Spark. The differences will be listed on the basis of some of the parameters like performance, cost, machine learning algorithm, etc. 

  • Hadoop reads and writes files to HDFS, Spark processes data in RAM using a concept known as an RDD, Resilient Distributed Dataset. 
  • Spark can run either in stand-alone mode, with a Hadoop cluster serving as the data source, or in conjunction with Mesos. In the latter scenario, the Mesos master replaces the Spark master or YARN for scheduling purposes. 
  • Spark is structured around Spark Core, the engine that drives the scheduling, optimizations, and RDD abstraction, as well as connects Spark to the correct filesystem (HDFS, S3, RDBMS, or Elasticsearch). There are several libraries that operate on top of Spark Core, including Spark SQL, which allows you to run SQL-like commands on distributed data sets, MLLib for machine learning, GraphX for graph problems, and streaming which allows for the input of continually streaming log data. 

Below is a table of differences between Hadoop and Spark:

Basis

Hadoop

Spark

Processing SpeedHadoop’s MapReduce model reads and writes from a disk, thus slowing down the processing speed.Spark reduces the number of read/write cycles to disk and stores intermediate data in memory, hence faster-processing speed.
UsageHadoop is designed to handle batch processing efficiently.Spark is designed to handle real-time data efficiently.
LatencyHadoop is a high latency computing framework, which does not have an interactive mode.Spark is a low latency computing and can process data interactively.
Data With Hadoop MapReduce, a developer can only process data in batch mode only.Spark can process real-time data, from real-time events like Twitter, and Facebook.
CostHadoop is a cheaper option available while comparing it in terms of costSpark requires a lot of RAM to run in-memory, thus increasing the cluster and hence cost.
Algorithm Used The PageRank algorithm is used in Hadoop.Graph computation library called GraphX is used by Spark.
Fault ToleranceHadoop is a highly fault-tolerant system where data is replicated across the nodes and used the data in case of any issue.Spark uses a DAG to rebuild the data across the nodes.
SecurityHadoop supports LDAP, ACLs, SLAs, etc and hence it is extremely secure.Spark is not secure, it relies on the integration with Hadoop to achieve the necessary security level. 
Machine LearningData fragments in Hadoop can be too large and can create bottlenecks. Thus, it is slower than Spark.Spark is much faster as it uses MLib for computations and has in-memory processing.
PerformanceHadoop has a slower performance as it uses disk for storage and depends upon disk read and write operations. It has fast performance with reduced disk reading and writing operations. 
ScalabilityHadoop is easily scalable by adding nodes and disk for storage. It supports tens of thousands of nodes.It is quite difficult to scale as it relies on RAM for computations. It supports thousands of nodes in a cluster. 
Language supportIt uses Java or Python for MapReduce apps.It uses Java, R, Scala, Python, or Spark SQL for the APIs.
User-friendlinessIt is more difficult to use. It is more user-friendly.
Resource ManagementYARN is the most common option for resource management.It has built-in tools for resource management.
My Personal Notes arrow_drop_up
Recommended Articles
Page :

Start Your Coding Journey Now!