Skip to main content

Big Data analysis on Cloud

Its very exciting to share a successful implementation of Map Reduce framework on Amazon's AWS infrastructure!

This was for a fortune 500 beverage company where input data comes from couple of different market research companies like Nielsen. The first step was to get rid of Nielsen's proprietary client Nitro and get more control on the monthly data analysis by storing it in MySQL database. While doing so, already we brought down the data analysis period from 5 weeks to 2 weeks.

Now my team has implemented a MAP REDUCE architecture of distributing data processing by adding parallel worker nodes for ETL (happening on windows) and data analysis (on Linux) and which in turn would be aggregated on a master node which releases data to dashboard. It is represented pictorially below:

High Level Architecture
To publish the data, one leading online charting system, iCharts is used which supports various input data format giving great flexibility to the users.

The amazing fact is that 12 days total process has come down to just merely 3 days! and this is in the same infrastuctre cost. This could happen just because of Cloud where these nodes are started on the fly using AWS scripts as soon as monthly data becomes available and after 3 days' of processing, are parked back to AWS account. 

Thanks to cloud and thanks a ton to Amazon!!

Here are few stats just to get a feel of the siZe:
  • Total raw data size : 130Gb
  • Total data points crunched : 34 trillion
  • Data points after analysis : 240 billion
  • Total reports : 121 (on an avg. 6 table charts per report)
  • Total ec2 instances : 11 windows, 12 linux (for 3 days)

Though in today's date this does not perfectly qualify for BIG DATA, I am sure this solution is easily salable to handle petabytes of data and would be eager to grab the opportunity to prove this! J

Last but not the least, I would like to give special thanks to my colleagues Hemant and Prashant whose dedicated efforts helped a lot during execution. Thank you guys!

Cloud has tremendous power when it comes to elasticity, scalability and I am sure there would be more and more such case studies keep popping up on different blogs ...


Popular posts from this blog

Mastering Hadoop: Book Review

I came across a book Mastering Hadoop published by Packt and authored by Sandeep Karath. Here is my detail review about the book- SUMMARY This book is based on most popular massive parallel programming (MPP) framework " Hadoop " and its eco-system. This is an intermediate level book where author goes in depth on not only the principle subject but also on most of the supporting eco-systems like hive, pig, stream, etc. The book has 374 pages with 12 chapters, the ToC  itself is spanned across 7 pages! It has conceptual as well as hands on lab experiences with lot of code churned into. OPINION The book starts with genealogy of Hadoop where the author has nicely narrated the evolution of web search to current state and then various releases of Hadoop. Good reasoning as why Hadoop 2.0 was essential to move ahead from previous version. Touches the architecture starting from high level 3-layered, drilling down step by step to cluster and node level. Describes all the feat

Giveaway contest! Win a copy of "Pentaho for Big Data Analytics" book (CLOSED)

I am very excited to launch a giveaway contest for the book Pentaho for Big Data Analytics . HOW TO ENTER: To enter the contest, simply visit the book page  once and leave your comments at the bottom of this blog. To ensure you get a copy of  Pentaho for Big Data Analytics , consider purchasing it on  Amazon . PRIZE: Two lucky winners will receive a paperback copy of  Pentaho for Big Data Analytics  by Manoj R Patil & Feris Thia . (For those not residing at US or Europe would get e-Book.) THE RULES: One entry per e-mail address. Contest will run from 2/27/14 through 3/9/14 Two lucky winners will be selected on or around 3/11/14 Open to all residents. ABOUT BOOK : The book will help you achieving following objectives: Get to grips with the Pentaho suite Explore the basics of Big Data and its business context Set up a Pentaho business analytics server Consume Big Data on HDFS platform using Pentaho Data Integration Create visualization with P

Is blockchain a technology or an algorithm?

After a phenomenal growth of bitcoin in 2017, all of sudden everyone in the cyber world has started talking about crypto-currency and the technology behind it - Blockchain . I am sure you too must be flowing through this new fanfare. So here I would be trying to explain this platform in just 11 mins . What is Blockchain? Blockchain is an ever-growing list of transactions , called Blocks which are always linked to their previous Blocks and are secured by cryptography hash. This Blockchain will be stored on distributed peer-to-peer (P2P) network nodes. If you are familiar with BitTorrent, you can easily understand this P2P communication. Each block has three things: data, its own hash and hash to previous block Data can be regarded as a ledger. Hash can be compared with fingerprint, which is a unique identification of the block. It is generated based on the content and even a single character change will make it different. Hash to previous Block: this creates a link