Apache Spark is an open-source framework for big data processing and analytics on a distributed computing cluster. It combines SQL, batch, streaming, and analytics under one umbrella to quickly develop a variety of big data applications.
Apache Hadoop is also an open-source framework for distributed storage and processing of big data in a cluster environment. The core of Apache Hadoop consists of a storage part HDFS (Hadoop Distributed File System) and a processing part (MapReduce). Apache Hadoop has been traditionally used for big data MapReduce jobs.
While both Spark and Hadoop are fault-tolerant, one of the basic and yet a major difference between the two big data computing frameworks is the way they do data processing and computing. While Hadoop stores data in disk, Spark stores data in-memory. Due to this in-memory processing, Spark is 10-100 X faster than Hadoop.
So does this mean Spark is going to replace Hadoop in future?
Since Hadoopís MapReduce is based on disk based computing, it is more suitable for single pass computations. It is not at all suitable for iterative computations, where output of one algorithm needs to be passed to another algorithm.
To run iterative jobs, we need a sequence of MapReduce jobs where the output of one step needs to be stored in HDFS before it can be passed to the next step. So the next step cannot be invoked until the previous step has completed.
Apart from the above limitations, Hadoop needs integration with several other frameworks/tools to solve big data uses cases (like Apache Storm for stream data processing, Apache Mahout for machine learning).
One of the major advantages of using Apache Spark is speed. Applications can run 100X faster in memory and 10X faster while running on disk . Unlike Hadoop, it stores the intermediate results in memory (instead of disk). This significantly reduces number of disk I/Oís. The image below shows a glimpse of the logistic regression in Hadoop and Spark and they are starkly differentiated by the running times.
Logistic Regression in Hadoop and Spark
In order to achieve faster in-memory computations, Spark uses the concept of RDD (Resilient Distributed Dataset). Spark partitions the data in multiple datasets which can be distributed across clusters and can be quickly re-constructed in case of any node failures.
From the Spark academic paper: "RDDs achieve fault tolerance through a notion of lineage: if a partition of an RDD is lost, the RDD has enough information about how it was derived from other RDDs to be able to rebuild just that partition." Contrary to this, Hadoop replicates data across nodes of the cluster for redundancy which involves lot of network I/O also.
Since Apache Spark is compatible with Hadoop data, it can run in Hadoop clusters through YARN apart from Spark's in-built standalone mode. Apart from processing data from local file system, Apache Spark can process data from HBase, Cassandra, Hive (and also any Hadoop InputFormat).
This makes it easy to migrate existing Hadoop applications to Spark.
Apart from this, Spark also offers the following features:
Due to in-memory computations, Spark uses more RAM instead of disk I/O. So Spark deployments need high-end physical machines. So for the data which does not fit into memory, we still need disk-based computations. In case RAM available is less than the data to be processed, Spark does provide disk-based data handling.
Spark provides advanced, real-time analytics framework where MapReduce lacks. To the contrary, MapReduce is well suited for large data processing where quick processing is not a key parameter. As the volume of data grows day by day and coming with high velocity, MapReduce falls short when it comes to low latency analytics.
So, it is clear that Sparkís performance is better when the entire data to be processed fits well in the memory. Hadoopís MapReduce is designed for data that doesnít fit in the memory and can run alongside other services.
Considering all these advantages and constraints of both Spark and MapReduce, it seems both the worlds are here to co-exist and complement each other, rather than Spark replacing Hadoopís MapReduce.
Spark is more towards high-performance plain data processing, where it can do both real-time and batch-oriented processing. Hadoop MapReduce is more inclined towards pure batch processing. For real-time processing, Apache Spark is required whereas machine learning (filtering, clustering, classification, etc.) can be achieved via Apache Mahout (Spark has inbuilt MLib).
Being a single execution engine, even though Spark has an upper hand, it cannot be used on its own. It would still need HDFS for data storage. Choosing between them depends on the variables that we are going to use and the problem we need to solve. We can always leverage the best of both the worlds.
Now since Apache Spark is seen as a complement to the Hadoop ecosystem, there could be several use cases where Hadoopís massive semi-structured data storage in HDFS can be combined with Apache Sparkís faster execution.
One such scenario is ìLambda Architectureî implementation of big data analytics systems where a massive distributed storage is required for batch processing of historical data (considering historical data is growing many folds over the time). In such implementations, real time stream processing and machine learning is required to provide an actionable in-sight in near real time 
These cookies are necessary for the website to function and cannot be switched off.
These cookies allow us to monitor traffic to our website so we can improve the performance and content of our site. They help us to know which pages are the most and least popular and see how visitors move around the site. All information these cookies collect is aggregated and therefore anonymous. If you do not allow these cookies we will not know when you have visited or how you navigated around our website.
These cookies enable the website to provide enhanced functionality and content. They may be set by the website or by third party providers whose services we have added to our pages. If you do not allow these cookies then some or all of these services may not function properly.
These cookies may be set through our site by our advertising partners. They may be used by those companies to build a profile of your interests and show you relevant adverts on other sites. They do not store directly personal information, but are based on uniquely identifying your browser and internet device. If you do not allow these cookies, you will experience less targeted advertising.
Do you have an upcoming project and wantus
to help speed up your time to market?