Data Processing |
Hadoop is mainly designed for batch processing which is very efficient in processing large datasets. |
It supports batch processing as well as stream processing. |
It supports both batch and stream processing. Flink also provides the single run-time for batch and stream processing. |
Stream Engine |
It takes the complete data-set as input at once and produces the output. |
Process data streams in micro-batches. |
The true streaming engine uses streams for workload: streaming, micro-batch, SQL, batch. |
Data Flow |
Data Flow does not contain any loops. supports linear data flow. |
Spark supports cyclic data flow and represents it as (DAG) direct acyclic graph. |
Flink uses a controlled cyclic dependency graph in run time. which efficiently manifest ML algorithms. |
Computation Model |
Hadoop Map-Reduce supports the batch-oriented model. |
It supports the micro-batching computational model. |
Flink supports a continuous operator-based streaming model. |
Performance |
Slower than Spark and Flink. |
More than Hadoop lesser than Flink. |
Performance is highest among these three. |
Memory management |
Configurable Memory management supports both dynamically or statically management. |
The Latest release of spark has automatic memory management. |
Supports automatic memory management |
Fault tolerance |
Highly fault-tolerant using a replication mechanism. |
Spark RDD provides fault tolerance through lineage. |
Fault tolerance is based on Chandy-Lamport distributed snapshots results in high throughput. |
Scalability |
Highly scalable and can be scaled up to tens of thousands of nodes. |
Highly scalable. |
It is also highly scalable. |
Iterative Processing |
Does not support Iterative Processing. |
supports Iterative Processing. |
supports Iterative Processing and iterate data with the help of its streaming architecture. |
Supported Languages |
Java, C, C++, Python, Perl, groovy, Ruby, etc. |
Java, Python, R, Scala. |
Java, Python, R, Scala. |
Cost |
Uses commodity hardware which is less expensive |
Needed lots of RAM so the cost is relatively high. |
Apache Flink also needed lots of RAM so the cost is relatively high. |
Abstraction |
No Abstraction in Map-Reduce. |
Spark RDD abstraction |
Flink supports Dataset abstraction for batch and DataStreams |
SQL support |
Users can run SQL queries using Apache Hive. |
Users can run SQL queries using Spark-SQL. It also supports Hive for SQL. |
Flink supports Table-API which are similar to SQL expression. Apache foundation is panning to add SQL interface in its future release. |
Caching |
Map-Reduce can not cache data. |
It can cache data in memory |
Flink can also cache data in memory |
Hardware Requirements |
Runs well on less expensive commodity hardware. |
It also needed high-level hardware. |
Apache Flink also needs High-level Hardware |
Machine Learning |
Apache Mahout is used for ML. |
Spark is so powerful in implementing ML algorithms with its own ML libraries. |
FlinkML library of Flink is used for ML implementation. |
Line of code |
Hadoop 2.0 has 1,20,000 lines of codes. |
developed in 20000 lines of codes. |
It is developed in Scala and Java so no. of lines of code is less then Hadoop. |
High Availability |
Configurable in High Availability Mode. |
Configurable in High Availability Mode. |
Configurable in High Availability Mode. |
Amazon S3 connector |
Provides Support for Amazon S3 Connector. |
Provides Support for Amazon S3 Connector. |
Provides Support for Amazon S3 Connector. |
Backpressure Handing |
Hadoop handles back-pressure through Manual Configuration. |
Spark also handles back-pressure through Manual Configuration. |
Apache Flink handles back-pressure Implicitly through System Architecture |
Criteria for Windows |
Hadoop does not have any windows criteria since it does not support streaming. |
Spark has time-based window criteria. |
Flink has record-based Flink Window criteria. |
Apache License |
Apache License 2. |
Apache License 2. |
Apache License 2. |