Skip to main content

Mastering Hadoop: Book Review

I came across a book Mastering Hadoop published by Packt and authored by Sandeep Karath. Here is my detail review about the book-

SUMMARY
This book is based on most popular massive parallel programming (MPP) framework "Hadoop" and its eco-system. This is an intermediate level book where author goes in depth on not only the principle subject but also on most of the supporting eco-systems like hive, pig, stream, etc. The book has 374 pages with 12 chapters, the ToC  itself is spanned across 7 pages! It has conceptual as well as hands on lab experiences with lot of code churned into.


OPINION
The book starts with genealogy of Hadoop where the author has nicely narrated the evolution of web search to current state and then various releases of Hadoop. Good reasoning as why Hadoop 2.0 was essential to move ahead from previous version. Touches the architecture starting from high level 3-layered, drilling down step by step to cluster and node level. Describes all the features of Hadoop2.x nicely and then talks about 4 major hadoop distros.

The concepts of MapReduce (MR) algorithm like merge and spills of intermediate outputs, stagglers, job counter, data joins packed in chapter 2. On the labs front, it has explained MapReduce example to great detail. explains custom RecordReader implementation. Some tips are really handy like heuristic formula for calculating optimum number of reducers. It is to be noted that the chapter assumes the reader has basic knowledge of this algo and it talks about the advance concepts.

Chapter 3, Pig talks about in-dept execution process of pig latin script and semantics along with many tips to optimize the query performance. It also shows practicle ways to use Pig for joining, as combiner, as abstract data analyzer for Data Acyclic Graph (DAG).

Hive in ch 4, another way to scoop the data from Hadoop in a conventional SQL-like style from RDBMS world. This is also covered in pretty details starting with its architecture to HiveQL semantics, execution steps and optimization tips like indexing, partioning, etc. It also captures exntensibles like UDF, UDAF and UDTF.

Hadoop Serialization and I/O talks about techniques of SerDe. After talking about Hadoop's own implementation and JDK's implementation, it slowly starts Apache's Avro tool with clearly stating its advantages and detail example. It explains the steps of Avro/Pig and Avro/Hive integration.

Chapter 6 & 7 talks about Yarn and Storm. YARN introduces the new architecture along with example of writing client and scheduling job plus ways to monitor it. Storm talks about low latency processing (aka real time processing), compares between Hadoop MR and Apache Storm with with the help of process diagrams and also explains the concepts of spout, bolt and topology with the help of java based example. It ends with installation on hadoop.

Then it flows with Hadoop off premise offerings (Cloud based!) like Anazon's AWS based EMR and Microsoft's Azure based HDInsight with enough comparison points as well as enough configuration steps.

Hadoop replacements gets into debating pros and cons of HDFS and possible extensions like AWS S3 which can make it more powerful. Actually adding more points on alternative systems likes of Cassandra, Ceph, GlusterFS would gave been value addition here.

Then it delves into features like HDFS Federation, Hadoop Security with its four pillars Authentication, Authorization, Auditing and Data Protection with each explained in great detail.

And here comes ch 12, Analytics using Hadoop: Its Machine Learning is a very interesting topic to have in this book but not sure if to that extent of detail. You need to have some statistical knowledge to understand some tpoics from the chapter as it talks about the terms/algos like tf-idf, k-means clustering. At the end, it talks about data analysis libraries: RHadoop, Mahout. Overall this chapter provides good handles on analytics.

The book ends with the appendix of "Hadoop for MS Windows". Thanks to Hortonworks! Now you can get Hadoop distro on win platform as well as their PaaS offering on MS Azure, more details follow in this chapter.

The author definitely seems having a rich experience in the field and is successful in conveying the depth of the subject through this book. Also the source code for the book is available at github.

In otherwise crowded Hadoop beginners' books, this one is different and catering an intermediate level. I wish all the very best to this efforts...


WHO SHOULD READ THIS BOOK
Anyone who has prior knowledge of Hadoop1.x can easily upgrade himself to Hadoop 2.x YARN. But then even the one with little knowledge of database and java can read this book to explore this new eco-system to enhance existing skills.

Comments

  1. The expansion of internet and intelligence in business process lead the way to huge volume of data. It is important to maintain and process these data to be efficient in data handling. Hadoop Training in Chennai | Big Data Course in Chennai

    ReplyDelete
    Replies
    1. Thank You for sharing your article, This is an interesting & informative blog. It is very useful for the developer like me. Kindly keep blogging. Besides that Wisen has established as Best Corporate Training Companies in Chennai .

      Nowadays JavaScript has tons of job opportunities on various vertical industry. Know more about JavaScript Framework Training visit Corporate Training Companies in India.

      This post gives me detailed information about the technology. corporate training in chennai

      Delete
  2. Excellent post, now a day’s huge demand for the certified java professionals in IT industry. Java gives more career opportunity for the fresher’s as well as experienced experts.
    Regards,
    JAVA Training in Chennai|JAVA Course in Chennai

    ReplyDelete
  3. Hi, Really your post was very informative. Today's internet era learn Hadoop Online Training will helps you to reach your goal.Selenium Training

    ReplyDelete
  4. Thanks for your informative blog!!! Todays more demand on certified Developers and Adminstrators on Hadoop in companies.Keep on updating your with such awesome information about Hadoop.
    Big Data Hadoop Training In Hyderabad

    ReplyDelete
  5. This comment has been removed by a blog administrator.

    ReplyDelete
  6. This comment has been removed by a blog administrator.

    ReplyDelete
  7. Really useful information about hadoop, i have to know information about hadoop online training institutes.

    ReplyDelete
  8. Webtrackker Indirapuram offers an inclusive software testing training in Indirapuram. The extensive practical training provided by the Software Testing training institute in Indirapuram, equips live projects and simulations. Such a detailed course in Software Testing has helped our students to obtain work in several multinationals. The Webtrackker trainers are subject to specialized corporate professionals who offer an in-depth study in the Software Testing course in Indirapuram.
    software testing institute in Indirapuram

    ReplyDelete

  9. Webtrackker is the best Salesforce online training in india, Do not assume that all sales employees have understood how the training should be applied. Sales training is largely generic. There may be a gap between knowing how to apply a principle. You want to make sure you close that gap. If necessary, take a new language. If the training requires a new language or terms that you have not used before, adjust the new terms as part of your sales vocabulary. This will help strengthen the training. Webtrackker is the best training in India Do not conduct sales training that is not in line with your sales philosophy. Before investing in a sales training program, make sure the curriculum matches your sales philosophy. For example, if you use a strategic sales process, do not send your salespeople to training that focuses primarily on tactics and not strategies. Keep the goals of the sales team members that you want to achieve with the salesforce training before the salesforce training begins. Knowing what you want to stop training before you start training is very valuable. Aws online training in india
    Salesforce online training in india

    ReplyDelete
  10. This comment has been removed by a blog administrator.

    ReplyDelete
  11. This comment has been removed by a blog administrator.

    ReplyDelete
  12. This comment has been removed by a blog administrator.

    ReplyDelete
  13. Sirkus System Bangalore Reviews- Sirkus System IT Services Pvt Ltd a logo name specialized in product improvement & answers for mobile environment and other platforms Sirkus device Bangalore critiques- Quality development, dedicated work approach and professional attitude are some of the traits which outline Sirkus Systems IT Services Pvt Ltd.

    Sirkus system
    sirkus system
    Sirkus Systems
    sirkus system review
    Sirkus System
    Sirkus System Reviews
    Sirkus System
    Sirkus System Review





















    ReplyDelete
  14. This comment has been removed by the author.

    ReplyDelete
  15. Java training in indirapuram- There are multiple structures and streams for developing a product or utility. When we talk of technology and programming languages, Java is the maximum desired platform. It is used to expand a whole lot of programs for the systems and embedded devices like cellular telephones, drugs, laptops, and many others.

    Java training in indirapuram

    Hadoop training in indirapuram

    sas training in indirapuram

    sap training in indirapuram

    linux training in indirapuram

    sap fico training in indirapuram

    web design training in indirapuram

    php training in indirapuram

    ReplyDelete
  16. Great post and informative blog on hadoop. It was awesome to read, thanks for sharing this great content to my vision.
    BE/B.Tech Project Center in Chennai | ME/M.Tech Project Center in Chennai | Final Year Project Center in Chennai

    ReplyDelete
  17. Your new valuable key points imply much a person like me and extremely more to my office workers. With thanks; from every one of us.


    white label website builder

    mobile website builder

    ReplyDelete
  18. It is amazing and wonderful to visit your site.Thanks for sharing this information,this is useful to me

    digital marketing training in chennai

    ReplyDelete
  19. Thanks a lot very much for the high quality and results-oriented help. I won’t think twice to endorse your blog post to anybody who wants and needs support about this area.
    big-data-hadoop-training-institute-in-bangalore

    ReplyDelete
  20. CIITN is the Best Php training institute in Noida and delhi Ncr. You will get Live Project Training on PHP by our PHP expert who have 5+ year industrial experience.Focus on practical and live project training. In our PHP training, we you will learn core PHP, advance PHP, HTML, CSS, JavaScript, jQuery, Bootstrap, Cake PHP and Wordpress.CIITN provides 100% job assistance in PHP training. CIITN is well known PHP coaching center because our 100% PHP students are placed now.


    Ciitnoida provides Core and java training institute in noida. We have a team of experienced Java professionals who help our students learn Java with the help of Live Base Projects. The object-oriented, class-based build of Java has made it one of most popular programming languages and the demand of professionals with certification in Advance Java training is at an all-time high not just in India but foreign countries too.

    By helping our students understand the fundamentals and Advance concepts of Java, we prepare them for a successful programming career. With over 13 years of sound experience, we have successfully trained hundreds of students in Noida and have been able to turn ourselves into an institute for best Java training in Noida.


    java training institute in noida
    php training in noida
    linux training in noida
    linux institute in noida
    java course in noida

    ReplyDelete
  21. Needed to compose you a very little word to thank you yet again regarding the nice suggestions you’ve contributed here.

    blue prism training in chennai


    ReplyDelete
  22. It is stunning and awesome to visit your site.Thanks for sharing this information,this is helpful to me planet-php

    ReplyDelete
  23. Techonolgy is updated day to day
    Thanks for sharing the info
    ">Salesforce Training

    ReplyDelete
  24. Nice post keep do posting , Hadoop is best platform for the data securty and how the data will flows form one network to another network, There are different modules like HIVE PIG MYSQL and looking for the
    Best Amazon web Services Training Hyderabad
    Learn Online DevOps Training

    ReplyDelete
  25. Thanks for share this information. I have read your blog. Your information
    is really helpful for me. Keep update your blog.
    Guest posting sites
    Technical updates

    ReplyDelete
  26. Extremely Informative post a debt of gratitude is in order for the sharing.
    Education | Article Submission sites | Technology

    ReplyDelete
  27. Wonderful blog & good post.Its really helpful for me, awaiting for more new post. Keep Blogging !!
    Blue Prism Training in Chennai | Blue Prism Training Institute in Chennai

    ReplyDelete
  28. Brilliant article. The information I have been searching precisely. It helped me a lot, thanks. Keep coming with more such informative article. Would love to follow them.
    sap abap training online

    ReplyDelete
  29. very nice one and the informations are so valuable. Best devops training in chennai

    ReplyDelete
  30. It has been simply incredibly generous with you to provide openly what exactly many individuals would’ve marketed for an eBook to end up making some cash for their end, primarily given that you could have tried it in the event you wanted.
    www.besanttechnologies.in/hadoop-training-in-bangalore.

    ReplyDelete
  31. Nice post keep do posting The Info was too good, for more information regarding the technology Click
    Amazon web Services Training
    Professional Salesforce CRM Training

    ReplyDelete
  32. Great blog! Really awesome I got more information from this blog. Thanks for sharing with us.

    salesforce developer training in chennai

    salesforce administrator training in chennai

    ReplyDelete
  33. This comment has been removed by the author.

    ReplyDelete
  34. Webtrackker Technology
    C-67,Noida sec-63
    url: http://webtrackker.com/Oracle-DBA-Training-institute-in-Noida.php
    Oracle Training institute in Noida

    ReplyDelete
  35. wow is what comes to my mind... its amazing that a simple plastic wrap can be turned into something mystical
    Big Data Training in Chennai |
    Big Data Training |
    Big Data Course in Chennai


    ReplyDelete
  36. The desire to play is normal, if you are passionate passion, then use it for the benefit of yourself. best online casino roulette I would play as if I live on the last day.

    ReplyDelete
  37. Thanks for sharing this valuable information. Check on the below link if you are looking for best Hadoop training in chennai.

    Hadoop Training In Chennai

    ReplyDelete

Post a Comment

Popular posts from this blog

Hadoop Ecosystem

When it comes to Hadoop, still some people believe it as a single out of box system catering all big data problems. Unless you are thinking of some third party commercial distribution, this is not correct. In reality, Hadoop on its own is just HDFS and MapReduce. But if you want production ready Hadoop system, then you will have to also consider Hadoop friends (or components) which makes it a complete big data solution. 

Most of the components are coming as apache projects but few of them are non-apache open source or even commercial in some cases. This eco system is continuously evolving with large number of open source contributors. As shown in the above diagram. The following diagram gives high level overview of hadoop ecosystem.

Figure 1: Hadoop Ecosystem

The Hadoop ecosystem is logically divided into five layers which are self-explanatory. Some of the ecosystem components are explained below:

Data Storage is where the raw data will be residing at. There are multiple file systems sup…

Giveaway contest! Win a copy of "Pentaho for Big Data Analytics" book (CLOSED)

I am very excited to launch a giveaway contest for the book Pentaho for Big Data Analytics. HOW TO ENTER: To enter the contest, simply visit the book page once and leave your comments at the bottom of this blog. To ensure you get a copy of Pentaho for Big Data Analytics, consider purchasing it on Amazon. PRIZE: Two lucky winners will receive a paperback copy of Pentaho for Big Data Analytics by Manoj R Patil & Feris Thia. (For those not residing at US or Europe would get e-Book.)THE RULES: One entry per e-mail address.Contest will run from 2/27/14 through 3/9/14Two lucky winners will be selected on or around 3/11/14Open to all residents.
ABOUT BOOK: The book will help you achieving following objectives: