Skip to main content

Hadoop Ecosystem

When it comes to Hadoop, still some people believe it as a single out of box system catering all big data problems. Unless you are thinking of some third party commercial distribution, this is not correct. In reality, Hadoop on its own is just HDFS and MapReduce. But if you want production ready Hadoop system, then you will have to also consider Hadoop friends (or components) which makes it a complete big data solution. 

Most of the components are coming as apache projects but few of them are non-apache open source or even commercial in some cases. This eco system is continuously evolving with large number of open source contributors. As shown in the above diagram. The following diagram gives high level overview of hadoop ecosystem.

Figure 1: Hadoop Ecosystem

The Hadoop ecosystem is logically divided into five layers which are self-explanatory. Some of the ecosystem components are explained below:

Data Storage is where the raw data will be residing at. There are multiple file systems supported by Hadoop and also there are connectors available for data warehouse (DW) and relational databases.
HDFS is distributed file system comes out of box with Hadoop framework. It uses TCP/IP layer for communication. An advantage of using HDFS is data awareness between the job tracker and task tracker.
Amazon S3 filesystem is targeted at clusters hosted on the Amazon Elastic Compute Cloud (EC2) server-on-demand infrastructure. There is no rack-awareness in this file system, as it is all remote.
MapR’s maprfs provides high availability, transactional correct snapshots and higher performance than HDFS. Maprfs is available as part of the MapR distribution.
HBase is column oriented, multidimensional spatial database inspired by Google’s BigTable. HBase provides sorted data access by maintaining partitions or regions of data. The underlying storage is HDFS.

Hive is a data warehouse infrastructure with SQL like querying capabilities on hadoop datasets. The SQL interface makes Hive an attractive choice for developers to quickly validate data, for product managers and for analysts.
Pig is a high level data flow platform and execution framework for parallel computation. It uses the scripting language Pig Latin. Pig scripts are automatically converted into MapReduce jobs by the Pig interpreter, so you can analyze the data in a Hadoop cluster even if you aren't familiar with java & MapReduce.
Avro is a data serialization system which provides rich data format, container file to store persistent data, remote procedure call. It uses JSON to define data types and protocols, and serializes data in compact binary format.
Mahout is a machine learning software having core algorithms as (use and item based) recommendation or batch based collaborative filtering, classification and clustering. The core algorithms are implemented on top of Apache Hadoop using map/reduce paradigm though it can also be used outside hadoop world as math library focused on linear algebra and statistics.
Sqoop is designed for efficiently transferring bulk data between apache hadoop and structured datastores such as relational databses. It is a command line interface application supporting incremental loads of a single table or a free form SQL query as well as saved jobs which can be run multiple times to import updates made to a database since the last import. Imports can also be used to populate tables into Hive or HBase.

Oozie is a workflow scheduler system to manage Apache Hadoop jobs. It is server based workflow engine, where workflow is a collection of actions like hadoop map/reduce, Pig/Hive/Sqoop jobs arranged in a control dependency DAG (Directed Acyclic Graph). Oozie is scalable, reliable and extensible system.
Amazon’s Elastic MapReduce (EMR) provisions Hadoop cluster, running and terminating jobs, and handling data transfer between EC2 and S3 are automated by Elastic MapReduce.
Chukwa is an open source data collection system for monitoring large distributed systems. Chukwa is built on top of HDFS and Map/Reduce framework and inherits Hadoop’s scalability and robustness. Chukwa also includes a flexible and powerful toolkit for displaying, monitoring and analyzing results to make the best use of the collected data.
Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. It has a simple and flexible architecture based on streaming data flows. It is robust and fault tolerant with tunable reliability mechanisms and many failover and recovery mechanisms. It uses a simple extensible data model that allows for online analytic application.
ZooKeeper is another Apache Software Foundation’s project which provides open source distributed coordination service, synchronization service and naming registry for large distributed systems. ZooKeeper’s architecture supports high-availability through redundant services It uses hierarchical file system and is fault tolerant, high performing facilitates loose coupling.
ZooKeeper is already used by many Apache projects like HDFS, HBase as well as its running in production by Yahoo, FaceBook, Rackspace, etc.

Data Analytics is the area where lot of third party vendors are providing various proprietary as well as open source tools. Discussed few of them below:
Pentaho – has capability of data integration (kettle), analytics, reporting, visualization and predictive analytics directly from Hadoop nodes. It is available with enterprise support as well as community edition.
Storm – is a free and open source distributed, fault tolerant, real time computation system from unbounded streams of data.
Splunk – is an enterprise application, can perform real-time and historical search, as well as reporting and statistical analysis. It also provides cloud based flavor Splunk Storm.

While setting up the Hadoop ecosystem, you can either do setup on your own or can use third party distributions from the vendors like Amazon, MapR, Cloudera, Hortonworks, etc. Third party distributions might cost you little extra but takes away complexity of maintaining & supporting the system and you can focus on business problem.


  1. Nice summary. It would be great if you can go through a case study and demonstrate how this ecosystem works in real world applications.

    1. Thanks Deven your kind words and feedback. Your idea is fantastic and I will certainly work towards that.

  2. Nice information About hadoop ecosystem Thanks for sharing it
    Hadoop Training in Chennai

  3. Great information about Hadoop Ecosystem.It was useful for my hadoop studies.Keep in blogging.I am waiting for your next blog... Hadoop Training in Chennai
    Dot Net Training in Chennai

  4. The blog gave me idea about the hadoop ecosystem and the components of hadoop ecosystem are explained in an understandable manner my sincere thanks for sharing this post
    Hadoop Training in Chennai

  5. Good and nice information, thanks for sharing your views and ideas.. keep rocks and updating.

    Hadoop Training in chennai | Dot Net Training in chennai

  6. Well Said, you have furnished the right information that will be useful to anyone at all time. Thanks for sharing your Ideas.
    Hadoop Online Training
    Data Science Online Training

  7. Hi, I am really happy to found such a helpful and fascinating post that is written in well manner. Thanks for sharing such an informative post..Big Data Hadoop Training in Bangalore | Data Science Training in Bangalore

  8. I‘d mention that most of us visitors are endowed to exist in a fabulous place with very many wonderful individuals with very helpful things.
    hadoop training in bangalore
    hadoop training in chennai

  9. Needed to compose you a very little word to thank you yet again regarding the nice suggestions you’ve contributed here.

    Data Science Training in Bangalore

    Datascience Training in Chennai

  10. Really useful post about hadoop, i have to know information about hadoop online training institute in india.

  11. Your new valuable key points imply much a person like me and extremely more to my office workers. With thanks.

    UNIX Shell scripting training in chennai|ORACLE apps finance training in chennai

  12. Appreciating the persistence you put into your blog and detailed information you provide.
    Data Science Training in Hyderabad

  13. Great and decent data, a debt of gratitude is in order for sharing your perspectives and thoughts.. keep shakes and refreshing.
    Article Submission sites | Education | Technology | Latest Updates

  14. I have to agree with everything in this post. Thanks for the useful information.
    DOT NET Training in Chennai
    DOT NET Course in Chennai

  15. Your new valuable key points imply much a person like me and extremely more to my office workers. With thanks.
    Tibco Training From India


Post a Comment

Popular posts from this blog

Mastering Hadoop: Book Review

I came across a book Mastering Hadoop published by Packt and authored by Sandeep Karath. Here is my detail review about the book-

This book is based on most popular massive parallel programming (MPP) framework "Hadoop" and its eco-system. This is an intermediate level book where author goes in depth on not only the principle subject but also on most of the supporting eco-systems like hive, pig, stream, etc. The book has 374 pages with 12 chapters, the ToC  itself is spanned across 7 pages! It has conceptual as well as hands on lab experiences with lot of code churned into.

The book starts with genealogy of Hadoop where the author has nicely narrated the evolution of web search to current state and then various releases of Hadoop. Good reasoning as why Hadoop 2.0 was essential to move ahead from previous version. Touches the architecture starting from high level 3-layered, drilling down step by step to cluster and node level. Describes all the features of Hadoo…

Is blockchain a technology or an algorithm?

After a phenomenal growth of bitcoin in 2017, all of sudden everyone in the cyber world has started talking about crypto-currency and the technology behind it - Blockchain. I am sure you too must be flowing through this new fanfare. So here I would be trying to explain this platform in just 11 mins.

What is Blockchain? Blockchain is an ever-growing list of transactions, called Blocks which are always linked to their previous Blocks and are secured by cryptography hash. This Blockchain will be stored on distributed peer-to-peer (P2P) network nodes. If you are familiar with BitTorrent, you can easily understand this P2P communication. Each block has three things: data, its own hash and hash to previous block
Data can be regarded as a ledger.Hash can be compared with fingerprint, which is a unique identification of the block. It is generated based on the content and even a single character change will make it different.Hash to previous Block: this creates a link which can be traversed bac…