When it comes to Hadoop, still some people believe it as a single out of box system catering all big data problems. Unless you are thinking of some third party commercial distribution, this is not correct. In reality, Hadoop on its own is just HDFS and MapReduce. But if you want production ready Hadoop system, then you will have to also consider Hadoop friends (or components) which makes it a complete big data solution.
Most of the components are coming as apache projects but few of them are non-apache open source or even commercial in some cases. This eco system is continuously evolving with large number of open source contributors. As shown in the above diagram. The following diagram gives high level overview of hadoop ecosystem.
Most of the components are coming as apache projects but few of them are non-apache open source or even commercial in some cases. This eco system is continuously evolving with large number of open source contributors. As shown in the above diagram. The following diagram gives high level overview of hadoop ecosystem.
Figure 1: Hadoop Ecosystem
The Hadoop ecosystem is logically divided into five layers
which are self-explanatory. Some of the ecosystem components are explained
below:
Data Storage is
where the raw data will be residing at. There are multiple file systems
supported by Hadoop and also there are connectors available for data warehouse
(DW) and relational databases.
HDFS is
distributed file system comes out of box with Hadoop framework. It uses TCP/IP
layer for communication. An advantage of using HDFS is data awareness between
the job tracker and task tracker.
Amazon S3
filesystem is targeted at clusters hosted on the Amazon Elastic Compute Cloud (EC2)
server-on-demand infrastructure. There is no rack-awareness in this file
system, as it is all remote.
MapR’s maprfs
provides high availability, transactional correct snapshots and higher
performance than HDFS. Maprfs is available as part of the MapR distribution.
HBase is column
oriented, multidimensional spatial database inspired by Google’s BigTable.
HBase provides sorted data access by maintaining partitions or regions of data.
The underlying storage is HDFS.
Hive is a data
warehouse infrastructure with SQL like querying capabilities on hadoop datasets.
The SQL interface makes Hive an attractive choice for developers to quickly
validate data, for product managers and for analysts.
Pig is a high
level data flow platform and execution framework for parallel computation. It
uses the scripting language Pig Latin.
Pig scripts are automatically converted into MapReduce jobs by the Pig
interpreter, so you can analyze the data in a Hadoop cluster even if you aren't
familiar with java & MapReduce.
Avro is a data
serialization system which provides rich data format, container file to store
persistent data, remote procedure call. It uses JSON to define data types and
protocols, and serializes data in compact binary format.
Mahout is a
machine learning software having core algorithms as (use and item based)
recommendation or batch based collaborative filtering, classification and
clustering. The core algorithms are implemented on top of Apache Hadoop using
map/reduce paradigm though it can also be used outside hadoop world as math
library focused on linear algebra and statistics.
Sqoop is designed
for efficiently transferring bulk data between apache hadoop and structured
datastores such as relational databses. It is a command line interface
application supporting incremental loads of a single table or a free form SQL
query as well as saved jobs which can be run multiple times to import updates
made to a database since the last import. Imports can also be used to populate
tables into Hive or HBase.
Oozie is a workflow scheduler system to manage Apache Hadoop jobs. It is server based workflow engine, where workflow is a collection of actions like hadoop map/reduce, Pig/Hive/Sqoop jobs arranged in a control dependency DAG (Directed Acyclic Graph). Oozie is scalable, reliable and extensible system.
Amazon’s Elastic
MapReduce (EMR) provisions Hadoop
cluster, running and terminating jobs, and handling data transfer between EC2
and S3 are automated by Elastic MapReduce.
Chukwa is an
open source data collection system for monitoring large distributed systems.
Chukwa is built on top of HDFS and Map/Reduce framework and inherits Hadoop’s
scalability and robustness. Chukwa also includes a flexible and powerful toolkit
for displaying, monitoring and analyzing results to make the best use of the
collected data.
Flume is a
distributed, reliable, and available service for efficiently collecting,
aggregating, and moving large amounts of log data. It has a simple and flexible
architecture based on streaming data flows. It is robust and fault tolerant
with tunable reliability mechanisms and many failover and recovery mechanisms.
It uses a simple extensible data model that allows for online analytic
application.
ZooKeeper is
another Apache Software Foundation’s project which provides open source
distributed coordination service, synchronization service and naming registry
for large distributed systems. ZooKeeper’s architecture supports
high-availability through redundant services It uses hierarchical file system
and is fault tolerant, high performing facilitates loose coupling.
ZooKeeper is already used by many Apache projects like
HDFS, HBase as well as its running in production by Yahoo, FaceBook, Rackspace,
etc.
Data Analytics
is the area where lot of third party vendors are providing various proprietary
as well as open source tools. Discussed few of them below:
Pentaho – has
capability of data integration (kettle), analytics, reporting, visualization
and predictive analytics directly from Hadoop nodes. It is
available with enterprise support as well as community edition.
Storm – is a
free and open source distributed, fault tolerant, real time computation system
from unbounded streams of data.
Splunk – is an enterprise
application, can perform real-time and historical search, as well as reporting
and statistical analysis. It also provides cloud based flavor Splunk Storm.
While setting up the Hadoop ecosystem, you can either
do setup on your own or can use third party distributions from the vendors like
Amazon, MapR, Cloudera, Hortonworks, etc. Third party distributions might cost
you little extra but takes away complexity of maintaining & supporting the
system and you can focus on business problem.
Nice summary. It would be great if you can go through a case study and demonstrate how this ecosystem works in real world applications.
ReplyDeleteThanks Deven your kind words and feedback. Your idea is fantastic and I will certainly work towards that.
DeleteNice information About hadoop ecosystem Thanks for sharing it
ReplyDeleteHadoop Training in Chennai
Great information about Hadoop Ecosystem.It was useful for my hadoop studies.Keep in blogging.I am waiting for your next blog... Hadoop Training in Chennai
ReplyDeleteDot Net Training in Chennai
The blog gave me idea about the hadoop ecosystem and the components of hadoop ecosystem are explained in an understandable manner my sincere thanks for sharing this post
ReplyDeleteHadoop Training in Chennai
Good and nice information, thanks for sharing your views and ideas.. keep rocks and updating.
ReplyDeleteHadoop Training in chennai | Dot Net Training in chennai
Well Said, you have furnished the right information that will be useful to anyone at all time. Thanks for sharing your Ideas.
ReplyDeleteHadoop Online Training
Data Science Online Training
Hi, I am really happy to found such a helpful and fascinating post that is written in well manner. Thanks for sharing such an informative post..Big Data Hadoop Training in Bangalore | Data Science Training in Bangalore
ReplyDeleteI‘d mention that most of us visitors are endowed to exist in a fabulous place with very many wonderful individuals with very helpful things.
ReplyDeletehadoop training in bangalore
hadoop training in chennai
Needed to compose you a very little word to thank you yet again regarding the nice suggestions you’ve contributed here.
ReplyDeleteData Science Training in Bangalore
Datascience Training in Chennai
Really useful post about hadoop, i have to know information about hadoop online training institute in india.
ReplyDeleteWonderful information to sharing..Thank you..Mat Lab Projects Center in Chennai | Mat Lab Projects Center in Velachery
ReplyDeleteYour new valuable key points imply much a person like me and extremely more to my office workers. With thanks.
ReplyDeleteUNIX Shell scripting training in chennai|ORACLE apps finance training in chennai
It's True information.Thanks to sharing.
ReplyDeleteHadoop online Training|Informatica Online Training|ETL Testing Online Training
Appreciating the persistence you put into your blog and detailed information you provide.
ReplyDeleteData Science Training in Hyderabad
Great and decent data, a debt of gratitude is in order for sharing your perspectives and thoughts.. keep shakes and refreshing.
ReplyDeleteArticle Submission sites | Education | Technology | Latest Updates
I have to agree with everything in this post. Thanks for the useful information.
ReplyDeleteDOT NET Training in Chennai
DOT NET Course in Chennai
Your new valuable key points imply much a person like me and extremely more to my office workers. With thanks.
ReplyDeleteTibco Training From India
Nice article.
ReplyDeleteSalesforce Training From India
Best article.
ReplyDeleteInformatica Training From India
Thank youfor the information.
ReplyDeleteBest Online Training Instittue
Good and nice information, thanks for sharing your views and ideas.. keep rocks and updating.
ReplyDeleteOracle SCM Online Training
ORACLE OSB Online Training
SAP WM Online Training
Angular JS Online Training
Good and nice information. Thanks for sharing this post
ReplyDeletebest training institute for hadoop in Bangalore
best big data hadoop training in Bangalroe
hadoop training in bangalore
hadoop training institutes in bangalore
hadoop course in bangalore
Nice article.. thank you for sharing this article
ReplyDeletebest training institute for hadoop in Marathahalli
best big data hadoop training in Marathahalli
hadoop training in Marathahalli
hadoop training institutes in Marathahalli
hadoop course in Marathahalli
Nice Blog..
ReplyDeletebest training institute for hadoop in BTM
best big data hadoop training in BTM
hadoop training in btm
hadoop training institutes in btm
hadoop course in btm
your Blog is Infornative, share more blogs like this Data Science Online Training Hyderabad
ReplyDeleteNice post...
ReplyDeletejava training in BTM
spring training in BTM
java training institute in btm
spring and hibernate training in btm
After reading this blog. i got good knowledge
ReplyDeleterobotics courses in Marathahalli
robotic process automation training in Marathahalli
blue prism training in Marathahalli
rpa training in Marathahalli
automation anywhere training in Marathahalli
One of the great article, I have seen yet. Waiting for more updates.
ReplyDeleteData Science Course in Chennai
Data Science Training in Chennai
Machine Learning Course in Chennai
Machine Learning Training in Chennai
Azure Training in Chennai
Data Analytics Courses in Chennai
Data Science Training in Velachery
Data Science Training in Tambaram