Thursday, October 18, 2012

Getting Control of BIG DATA


I got a copy of HBR’s October edition and like its featured article “Getting Control of BIG DATA”. A is fantastic article giving deep insight in the subject! ...and the credit goes to authors: Andrew, a research scientist and Erik, an MIT professor. 

Big data is definitely having huge business potential but at the same time poses few business challenges especially from corporate culture's perspective.

Here are few bullets of general interest:

  • Why Big Data? …analyzing it leads to better predictions and better predictions yields better decisions. In sector after sector, companies that figure out how to combine domain expertise with data science will pull away from their rivals.
  • Web Fact: 2.5 exabytes of data created daily and this number is doubling each 40 months
  • Retail fact: Walmart collects 2.5 petabytes of data every hour from its customer transactions
  • Case Study (Retail) - Sears: Sears is using Hadoop cluster for data analytics. (personalized promotion time came down from 8 weeks to 1 week, saving cost and time!) It is interesting to know that they are directly storing the data on Hadoop clusters and doing real time analysis.
  • Case Study (Aviation) - PASSUR : Improved airline ETAs which helped in improving staff efficiency whose worth would be several million dollars a year at each airport
  • Big data can shift the corporate culture by muting the HiPPOs (Highest Paid People's Opinion): Rather than relying on “intuitive” decisions from people high up in the organization, it would be based on real data analytics.
  • It would redefine domain expert: As the big data movement advances, value (of role) of domain experts will shift from their HiPPO style answers to their data specific questions.
  • New lucrative role of Data Scientist: a person who can extract treasure out of messy, unstructured data

The article contains in depth discussion around this and it is worth to grab this copy from the stall and churn it out if you are serious business in upcoming technologies.

Sunday, August 19, 2012

Amazing Pentaho!

I am using PRD (Pentaho Report Designer) to build few PDF reports. So far I used to generate hundreds of reports manually from Pentaho BI server and that was my big time consuming task. I came across new transformation output step in kettle "Pentaho Reporting Output" which helps in automating the report generation. Find the sample kettle transformation below in the diagram -

Kettle Transformation using Report Output
Initially I was trying to integrate PRD 3.8.x prpt file with PDI (aka kettle) 4.3 and could not make it work. Later on with the help from pentaho forum, it is found that all the family are typically not inter-compatible with all versions, esp. backward compatibility can be issue sometime. The simple reason for this would be cost as it is community edition. When I created report with latest PRD 3.9, I could integrate it with Kettle without any hassle.

Jaspersoft and Pentaho are two leading open source BI solutions available in the market. But I found Pentaho family more matured as they have fantastic integration support with all the leading market solutions in this area like palo, hadoop, mongoDB, Cassendra just to name a few. Also its tools are grown organically like their ETL tool data integration aka Kettle.I think even though Pentaho does not have very good support for eclipse plugin, overall as a BI solution it has tremendous potential to become an enterprise level solution easily.

Monday, July 16, 2012

Hacknight by hasgeek, at Pune

This was my first ever experience of the hacknight! It was indeed a memorable night, and a note of thanks to @hasgeek team for an amazing well organized event, having good food, beverages and shelter! ...little goofing up of address venue direction is forgiven!! I must thank Mahesh as well for pushing a family guy to geek's hacknight. :)

The event happened on 14 July 2012 from 02:00 PM to 15 July 2012 10:00 AM, at AmiWorks, Pune. The place is not very difficult to locate. thanks to Zainab. It was a penthouse with all minimum facilities in place.

I reached there around 13:45 hours, and following true Indian Standard Time I made it 15 minutes earlier. A few hackers had already arrived and some had already started hacking the ideas! I got my self registered, pay'd the registration and fees. And then we started socializing around. During discussion, I was planning to join the group of Navin or Piyush but later found that both of them came just to kick start the event. But then during the night, got another two...Samyak and Kaustubh.

Nikhil had a great idea around travel locator based on neo4j and jaideep, a web hacker has idea on visualization using d3. I teamed up with krishna, a php geek, and tried to build analytical tool for pages from facebook. First time dived deep and understood facebook API. Overall it was a great learning getting familiar with neo4j, d3, facebook, hadoop, etc. More important, what is a geek!

Though the final analytic tool still remain a dream, it was a cool experience.

As krishna wanted to catch Sinhgad exp. @ 6:00am and even I was about to crash by that time, I left the group @5:40am without cool hacknight Tshirt... [Thanks Zainab for handing over Tee to my peer at B'lore event!]

Jaideep captured these moments in his high tech SLR. Here are few clippings/clicks -



Monday, June 18, 2012

Use MySQL Information Schema carefully!

While working on a procedure which was running on dynamically created data table to derive one complex number, I observed one strange problem where MySQL was continuously eating memory and in spite of 35GB RAM, it was running out of memory after few hours.


After debugging the algorithm, it is found that we were querying on information_schema.columns to get the column name for each calculation and after 10,000 queries, it MySQL would start crawling and we had to restart it to release the memory. The issue is fixed by adding one static table containing column names which would be populated each time the data table got created dynamically.


But then it is interesting to observe this behavior of system table which uses the cache but does not close it fast! In fact one can optimize the usage of information schema referring MySQL official manual.

Sunday, June 10, 2012

Big Data analysis on Cloud


Its very exciting to share a successful implementation of Map Reduce framework on Amazon's AWS infrastructure!

This was for a fortune 500 beverage company where input data comes from couple of different market research companies like Nielsen. The first step was to get rid of Nielsen's proprietary client Nitro and get more control on the monthly data analysis by storing it in MySQL database. While doing so, already we brought down the data analysis period from 5 weeks to 2 weeks.

Now my team has implemented a MAP REDUCE architecture of distributing data processing by adding parallel worker nodes for ETL (happening on windows) and data analysis (on Linux) and which in turn would be aggregated on a master node which releases data to dashboard. It is represented pictorially below:

High Level Architecture
To publish the data, one leading online charting system, iCharts is used which supports various input data format giving great flexibility to the users.

The amazing fact is that 12 days total process has come down to just merely 3 days! and this is in the same infrastuctre cost. This could happen just because of Cloud where these nodes are started on the fly using AWS scripts as soon as monthly data becomes available and after 3 days' of processing, are parked back to AWS account. 

Thanks to cloud and thanks a ton to Amazon!!

Here are few stats just to get a feel of the siZe:
  • Total raw data size : 130Gb
  • Total data points crunched : 34 trillion
  • Data points after analysis : 240 billion
  • Total reports : 121 (on an avg. 6 table charts per report)
  • Total ec2 instances : 11 windows, 12 linux (for 3 days)


Though in today's date this does not perfectly qualify for BIG DATA, I am sure this solution is easily salable to handle petabytes of data and would be eager to grab the opportunity to prove this! J

Last but not the least, I would like to give special thanks to my colleagues Hemant and Prashant whose dedicated efforts helped a lot during execution. Thank you guys!

Cloud has tremendous power when it comes to elasticity, scalability and I am sure there would be more and more such case studies keep popping up on different blogs ...

Thursday, May 6, 2010

Applying for Indian Passport, some tips (non-technology)

Recently I happened to visit passport office, Pune to to renew my passport and to my surprise it got rejected. I said "to my surprise" because while filling up the application form, my wife has done enough homework and ensured couple of times that there is no mistake in it. But then the officer who was scrutinizing the application form rejected in few minutes by saying that I should not have written my parents name in full. Rather it should be just like "last_name first_name" as was in my current passport. But if this is the case, then why its written in the application form "write FULL name"?

During my whole experience, I am putting down few tips which would be useful esp. for Pune based people:
  1. Make sure you take online appointment from http://passport.gov.in 
  2. While filling the application form, if its renewal, make sure you carry EXACTLY same information to the application form as is on your "to be expired" passport. This is very important because any change in that would either need to be furnished with appropriate proof or you will have to be ready to revisit the office!
  3. Typically appointment time would be 10:00am on any working day (they work 5 days of the week). So make sure you reach to office (located on Senapati Bapat Road) by 9:30am so that you can be in front position in the long queue.
  4. Don't carry laptop like stuff in your bag because you will be asked to keep your bags outside literally on the open ground.
  5. See if you can have one person accompanied as it will be useful to manage additional photocopies or other outside stuff if required, while you are standing in the queue.
  6. If you are adding your spouse' name, you must have marriage proof.
  7. All the documents should be in set of two with self attested.
  8. In case of dispute, don't argue with the passport officer as it will be meaningless. Rather keep patience and have your actions goal oriented!
  9. Come prepared with one full day leave as half day may not be sufficient.
  10. If you don't have time, there is alternative. You can catch some agent there and he will charge you just INR900 and then even your presence is not required to submit the application.
  11. After submitting the application, it takes around 15 days for police verification and then after another 15 days if aal izz well, you would receive the passport. voilĂ !
Hope this info is useful for someone who can save their time

Thursday, February 18, 2010

Cloud Computing: Next BIG revolution in IT (Can Avatar become reality?)

After Web 2.0, cloud computing is the latest buzz word in the industry. I remember IBM started started implementing first Grid computing architecture just a few years back and not its turning into widely accepted reality. Now we can say by and large that cloud is evolved from grid computing to fairly matured model. Amazon has become pioneer by launching their AWS suite. Still this idea has a long way to go...

Actually this is going to be debating topic if "grid computing" in cloud or cloud in "grid computing". ...each cloud can be made up of grids and all clouds can be in turn connected in outer grid! I think it really depends on how one visualized the things. But clearly the bottom line is to leverage the power of computing so that now instead of focusing more on building faster chips, industry can shift the focus on building efficient cloud and faster connectivity. I am sure roughly 50% of the machines of the world (probably more after considering huge redundancy in clustered environment) would be sitting idle considering overall industry servers as well as personal desktops. Going a bit deeper, the average load factor of machine would be even lesser than 50%. (I don't know if someone has spent time on this survey but would update the numbers if I found the one on net.) Wouldn't that be big success if all the machines would attain to their nearly 100% load factor? Biotech (esp. genetics) world would be the happiest who still needs more and more CPU power for analysis.

So considering these facts, it seems obvious to me that Cloud Computing is going to be the next BIG thing and it will not only impact IT but in general to each human and their lifestyle. Who knows in future with combination of RFID and Cloud, the whole earth will become one giant network just like the whole hair/ponytail concept that linked Na'vi with creatures of Pandora in Cameron's Avatar!