Its very exciting to share a successful implementation of Map Reduce framework on Amazon's AWS infrastructure!
This was for a fortune 500 beverage company where input data comes from couple of different market research companies like Nielsen. The first step was to get rid of Nielsen's proprietary client Nitro and get more control on the monthly data analysis by storing it in MySQL database. While doing so, already we brought down the data analysis period from 5 weeks to 2 weeks.
Now my team has implemented a MAP REDUCE architecture of distributing data processing by adding parallel worker nodes for ETL (happening on windows) and data analysis (on Linux) and which in turn would be aggregated on a master node which releases data to dashboard. It is represented pictorially below:
|High Level Architecture|
To publish the data, one leading online charting system, iCharts is used which supports various input data format giving great flexibility to the users.
The amazing fact is that 12 days total process has come down to just merely 3 days! and this is in the same infrastuctre cost. This could happen just because of Cloud where these nodes are started on the fly using AWS scripts as soon as monthly data becomes available and after 3 days' of processing, are parked back to AWS account.
Thanks to cloud and thanks a ton to Amazon!!
Here are few stats just to get a feel of the siZe:
- Total raw data size : 130Gb
- Total data points crunched : 34 trillion
- Data points after analysis : 240 billion
- Total reports : 121 (on an avg. 6 table charts per report)
- Total ec2 instances : 11 windows, 12 linux (for 3 days)
Though in today's date this does not perfectly qualify for BIG DATA, I am sure this solution is easily salable to handle petabytes of data and would be eager to grab the opportunity to prove this! J
Last but not the least, I would like to give special thanks to my colleagues Hemant and Prashant whose dedicated efforts helped a lot during execution. Thank you guys!
Cloud has tremendous power when it comes to elasticity, scalability and I am sure there would be more and more such case studies keep popping up on different blogs ...