Skip to main content

Hadoop Ecosystem

When it comes to Hadoop, still some people believe it as a single out of box system catering all big data problems. Unless you are thinking of some third party commercial distribution, this is not correct. In reality, Hadoop on its own is just HDFS and MapReduce. But if you want production ready Hadoop system, then you will have to also consider Hadoop friends (or components) which makes it a complete big data solution. 

Most of the components are coming as apache projects but few of them are non-apache open source or even commercial in some cases. This eco system is continuously evolving with large number of open source contributors. As shown in the above diagram. The following diagram gives high level overview of hadoop ecosystem.

Figure 1: Hadoop Ecosystem

The Hadoop ecosystem is logically divided into five layers which are self-explanatory. Some of the ecosystem components are explained below:

Data Storage is where the raw data will be residing at. There are multiple file systems supported by Hadoop and also there are connectors available for data warehouse (DW) and relational databases.
HDFS is distributed file system comes out of box with Hadoop framework. It uses TCP/IP layer for communication. An advantage of using HDFS is data awareness between the job tracker and task tracker.
Amazon S3 filesystem is targeted at clusters hosted on the Amazon Elastic Compute Cloud (EC2) server-on-demand infrastructure. There is no rack-awareness in this file system, as it is all remote.
MapR’s maprfs provides high availability, transactional correct snapshots and higher performance than HDFS. Maprfs is available as part of the MapR distribution.
HBase is column oriented, multidimensional spatial database inspired by Google’s BigTable. HBase provides sorted data access by maintaining partitions or regions of data. The underlying storage is HDFS.

Hive is a data warehouse infrastructure with SQL like querying capabilities on hadoop datasets. The SQL interface makes Hive an attractive choice for developers to quickly validate data, for product managers and for analysts.
Pig is a high level data flow platform and execution framework for parallel computation. It uses the scripting language Pig Latin. Pig scripts are automatically converted into MapReduce jobs by the Pig interpreter, so you can analyze the data in a Hadoop cluster even if you aren't familiar with java & MapReduce.
Avro is a data serialization system which provides rich data format, container file to store persistent data, remote procedure call. It uses JSON to define data types and protocols, and serializes data in compact binary format.
Mahout is a machine learning software having core algorithms as (use and item based) recommendation or batch based collaborative filtering, classification and clustering. The core algorithms are implemented on top of Apache Hadoop using map/reduce paradigm though it can also be used outside hadoop world as math library focused on linear algebra and statistics.
Sqoop is designed for efficiently transferring bulk data between apache hadoop and structured datastores such as relational databses. It is a command line interface application supporting incremental loads of a single table or a free form SQL query as well as saved jobs which can be run multiple times to import updates made to a database since the last import. Imports can also be used to populate tables into Hive or HBase.

Oozie is a workflow scheduler system to manage Apache Hadoop jobs. It is server based workflow engine, where workflow is a collection of actions like hadoop map/reduce, Pig/Hive/Sqoop jobs arranged in a control dependency DAG (Directed Acyclic Graph). Oozie is scalable, reliable and extensible system.
Amazon’s Elastic MapReduce (EMR) provisions Hadoop cluster, running and terminating jobs, and handling data transfer between EC2 and S3 are automated by Elastic MapReduce.
Chukwa is an open source data collection system for monitoring large distributed systems. Chukwa is built on top of HDFS and Map/Reduce framework and inherits Hadoop’s scalability and robustness. Chukwa also includes a flexible and powerful toolkit for displaying, monitoring and analyzing results to make the best use of the collected data.
Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. It has a simple and flexible architecture based on streaming data flows. It is robust and fault tolerant with tunable reliability mechanisms and many failover and recovery mechanisms. It uses a simple extensible data model that allows for online analytic application.
ZooKeeper is another Apache Software Foundation’s project which provides open source distributed coordination service, synchronization service and naming registry for large distributed systems. ZooKeeper’s architecture supports high-availability through redundant services It uses hierarchical file system and is fault tolerant, high performing facilitates loose coupling.
ZooKeeper is already used by many Apache projects like HDFS, HBase as well as its running in production by Yahoo, FaceBook, Rackspace, etc.

Data Analytics is the area where lot of third party vendors are providing various proprietary as well as open source tools. Discussed few of them below:
Pentaho – has capability of data integration (kettle), analytics, reporting, visualization and predictive analytics directly from Hadoop nodes. It is available with enterprise support as well as community edition.
Storm – is a free and open source distributed, fault tolerant, real time computation system from unbounded streams of data.
Splunk – is an enterprise application, can perform real-time and historical search, as well as reporting and statistical analysis. It also provides cloud based flavor Splunk Storm.

While setting up the Hadoop ecosystem, you can either do setup on your own or can use third party distributions from the vendors like Amazon, MapR, Cloudera, Hortonworks, etc. Third party distributions might cost you little extra but takes away complexity of maintaining & supporting the system and you can focus on business problem.


  1. Nice summary. It would be great if you can go through a case study and demonstrate how this ecosystem works in real world applications.

    1. Thanks Deven your kind words and feedback. Your idea is fantastic and I will certainly work towards that.

  2. Nice information About hadoop ecosystem Thanks for sharing it
    Hadoop Training in Chennai

  3. Great information about Hadoop Ecosystem.It was useful for my hadoop studies.Keep in blogging.I am waiting for your next blog... Hadoop Training in Chennai
    Dot Net Training in Chennai

  4. The blog gave me idea about the hadoop ecosystem and the components of hadoop ecosystem are explained in an understandable manner my sincere thanks for sharing this post
    Hadoop Training in Chennai

  5. Good and nice information, thanks for sharing your views and ideas.. keep rocks and updating.

    Hadoop Training in chennai | Dot Net Training in chennai

  6. Well Said, you have furnished the right information that will be useful to anyone at all time. Thanks for sharing your Ideas.
    Hadoop Online Training
    Data Science Online Training

  7. Hi, I am really happy to found such a helpful and fascinating post that is written in well manner. Thanks for sharing such an informative post..Big Data Hadoop Training in Bangalore | Data Science Training in Bangalore

  8. I‘d mention that most of us visitors are endowed to exist in a fabulous place with very many wonderful individuals with very helpful things.
    hadoop training in bangalore
    hadoop training in chennai

  9. Needed to compose you a very little word to thank you yet again regarding the nice suggestions you’ve contributed here.

    Data Science Training in Bangalore

    Datascience Training in Chennai

  10. Really useful post about hadoop, i have to know information about hadoop online training institute in india.

  11. Your new valuable key points imply much a person like me and extremely more to my office workers. With thanks.

    UNIX Shell scripting training in chennai|ORACLE apps finance training in chennai

  12. Appreciating the persistence you put into your blog and detailed information you provide.
    Data Science Training in Hyderabad

  13. Great and decent data, a debt of gratitude is in order for sharing your perspectives and thoughts.. keep shakes and refreshing.
    Article Submission sites | Education | Technology | Latest Updates

  14. I have to agree with everything in this post. Thanks for the useful information.
    DOT NET Training in Chennai
    DOT NET Course in Chennai

  15. Your new valuable key points imply much a person like me and extremely more to my office workers. With thanks.
    Tibco Training From India

  16. QuickBooks has made payroll management quite definitely easier for accounting QuickBooks Payroll Tech Support Number There are plenty people that are giving positive feedback once they process payroll

  17. You named a blunder and we also have the clear answer, this can be essentially the most luring features of QuickBooks Enterprise Technical Support Number channel available on a call at .You can quickly avail our other beneficial technical support services easily once we are merely a single call definately not you.

  18. The smart accounting software is richly featured with productive functionalities that save your time and accuracy associated with the work. Since it is accounting software, from time to time you may possibly have a query and can seek assistance. This is why why QuickBooks has opened toll free QuickBooks Help Number.

  19. We provide Quickbooks Payroll tech support team with regards to customers who find QuickBooks Payroll difficult to use. As QuickBooks Payroll Contact Phone Number we utilize the responsibility of resolving all the problems that hinder the performance associated with the exuberant software. There clearly was sometimes a number of errors that could bother your projects flow, nothing should be taken as burden with that said because the support team of Quickbooks Payroll customer service resolves every issue in minimal some time commendable expertise.

  20. Every user are certain to get 24/7 support services with this online technical experts using QuickBooks support contact number. When you’re stuck in times which you can’t discover ways to eradicate a concern, all that is necessary would be to dial QuickBooks Customer Service. Remain calm; they will inevitably and instantly solve your queries.

  21. QuickBooks has almost changed this can be of accounting. Nowadays accounting has exploded in order to become everyone’s cup of tea and that’s only become possible because because of the birth of QuickBooks Tech Support Phone Number.

  22. You are able to be assured; all of the errors and problems are handled because of the simplest running a business. Our specialists can get to work on your drawback at once. this is why we usually tend to square measure recognized for QuickBooks Support Phone Number client Support services. we have a tendency to rank our customers over something and therefore we try to give you a swish accounting and management expertise.

  23. QuickBooks Customer Support Number advisors are certified Pro-advisors’ and it has forte in furnishing any type of technical issues for QuickBooks. They have been expert and certified technicians of these domains like QuickBooks accounting,QuickBooks Payroll, Point of Sales, QuickBooks Merchant Services and Inventory issues to provide 24/7 service to your esteemed customers.

  24. During those times, you do not worry after all and simply reach our QuickBooks Enterprise Support Phone Number channel readily available for a passing fancy call.


Post a Comment

Popular posts from this blog

Mastering Hadoop: Book Review

I came across a book Mastering Hadoop published by Packt and authored by Sandeep Karath. Here is my detail review about the book-

This book is based on most popular massive parallel programming (MPP) framework "Hadoop" and its eco-system. This is an intermediate level book where author goes in depth on not only the principle subject but also on most of the supporting eco-systems like hive, pig, stream, etc. The book has 374 pages with 12 chapters, the ToC  itself is spanned across 7 pages! It has conceptual as well as hands on lab experiences with lot of code churned into.

The book starts with genealogy of Hadoop where the author has nicely narrated the evolution of web search to current state and then various releases of Hadoop. Good reasoning as why Hadoop 2.0 was essential to move ahead from previous version. Touches the architecture starting from high level 3-layered, drilling down step by step to cluster and node level. Describes all the features of Hadoo…

Is blockchain a technology or an algorithm?

After a phenomenal growth of bitcoin in 2017, all of sudden everyone in the cyber world has started talking about crypto-currency and the technology behind it - Blockchain. I am sure you too must be flowing through this new fanfare. So here I would be trying to explain this platform in just 11 mins.

What is Blockchain? Blockchain is an ever-growing list of transactions, called Blocks which are always linked to their previous Blocks and are secured by cryptography hash. This Blockchain will be stored on distributed peer-to-peer (P2P) network nodes. If you are familiar with BitTorrent, you can easily understand this P2P communication. Each block has three things: data, its own hash and hash to previous block
Data can be regarded as a ledger.Hash can be compared with fingerprint, which is a unique identification of the block. It is generated based on the content and even a single character change will make it different.Hash to previous Block: this creates a link which can be traversed bac…