Skip to content
Related Articles

Related Articles

Improve Article

Difference Between Hadoop and Apache Spark

  • Last Updated : 18 Sep, 2020

Hadoop: It is a collection of open-source software utilities that facilitate using a network of many computers to solve problems involving massive amounts of data and computation. It provides a software framework for distributed storage and processing of big data using the MapReduce programming model. 
Hadoop is built in Java, and accessible through many programming languages, for writing MapReduce code, including Python, through a Thrift client. It’s available either open-source through the Apache distribution, or through vendors such as Cloudera (the largest Hadoop vendor by size and scope), MapR, or HortonWorks. 

Apache Spark: It is an open-source distributed general-purpose cluster-computing framework. Spark provides an interface for programming entire clusters with implicit data parallelism and fault tolerance. 
Spark is structured around Spark Core, the engine that drives the scheduling, optimizations, and RDD abstraction, as well as connects Spark to the correct filesystem (HDFS, S3, RDBMS, or Elasticsearch). There are several libraries that operate on top of Spark Core, including Spark SQL, which allows you to run SQL-like commands on distributed data sets, MLLib for machine learning, GraphX for graph problems, and streaming which allows for the input of continually streaming log data. 



Below is a table of differences between Hadoop and Apache Spark: 



FeaturesHadoopApache Spark
Data ProcessingApache Hadoop provides batch processingApache Spark provides both batch processing and stream processing
Memory usageSpark uses large amounts of RAMHadoop is disk-bound
SecurityBetter security featuresIt security is currently in its infancy
Fault ToleranceReplication is used for fault toleranceRDD and various data storage models are used for fault tolereance
Graph ProcessingAlgorithms like PageRank is usedSpark comes with a graph computation library called GraphX
Ease of UseDifficult to useEasier to use
Real-time data processingIt fails when it comes to real-time data processingIt can process real-time data
SpeedHadoop’s MapReduce model reads and writes from a disk, thus slow down the processing speedSpark reduces the number of read/write cycles to disk and store intermediate data in-memory, hence faster-processing speed.
LatencyIt is high latency computing frameworkIt is a low latency computing and can process data interactively


Go Premium (An Ad Free Experience with many more features)

My Personal Notes arrow_drop_up
Recommended Articles
Page :