Open In App

Difference Between Hadoop and Apache Spark

Improve
Improve
Improve
Like Article
Like
Save Article
Save
Share
Report issue
Report

Hadoop is a collection of open-source software utilities that facilitate using a network of many computers to solve problems involving massive amounts of data and computation. It provides a software framework for distributed storage and processing of big data using the MapReduce programming model. 

Hadoop is built in Java, and accessible through many programming languages, for writing MapReduce code, including Python, through a Thrift client. It’s available either open-source through the Apache distribution, or through vendors such as Cloudera (the largest Hadoop vendor by size and scope), MapR, or HortonWorks. 

Apache Spark is an open-source distributed general-purpose cluster-computing framework. Spark provides an interface for programming entire clusters with implicit data parallelism and fault tolerance. 

Spark is structured around Spark Core, the engine that drives the scheduling, optimizations, and RDD abstraction, as well as connects Spark to the correct filesystem (HDFS, S3, RDBMS, or Elasticsearch). There are several libraries that operate on top of Spark Core, including Spark SQL, which allows you to run SQL-like commands on distributed data sets, MLLib for machine learning, GraphX for graph problems, and streaming which allows for the input of continually streaming log data. 
 

Hadoop-vs-Apache-Spark

Hadoop vs Apache Spark

Features Hadoop Apache Spark
Data Processing Apache Hadoop provides batch processing Apache Spark provides both batch processing and stream processing
Memory usage Hadoop is disk-bound  Spark uses large amounts of RAM
Security Better security features Its security is currently in its infancy
Fault Tolerance Replication is used for fault tolerance. RDD and various data storage models are used for fault tolerance.
Graph Processing Algorithms like PageRank is used. Spark comes with a graph computation library called GraphX.
Ease of Use Difficult to use. Easier to use.
Real-time data processing It fails when it comes to real-time data processing. It can process real-time data.
Speed Hadoop’s MapReduce model reads and writes from a disk, thus it slows down the processing speed. Spark reduces the number of read/write cycles to disk and store intermediate data in memory, hence faster-processing speed.
Latency It is high latency computing framework. It is a low latency computing and can process data interactively

Last Updated : 30 Sep, 2022
Like Article
Save Article
Previous
Next
Share your thoughts in the comments
Similar Reads