Skip to main content


Showing posts from 2012

Getting Control of BIG DATA

I got a copy of  HBR ’s October edition and like its featured article “Getting Control of BIG DATA”. A is fantastic article giving deep insight in the subject! ...and the credit goes to authors: Andrew , a research scientist and Erik , an MIT professor.  Big data is definitely having huge business potential but at the same time poses few business challenges especially from corporate culture's perspective. Here are few bullets of general interest: Why Big Data? …analyzing it leads to better predictions and better predictions yields better decisions. In sector after sector, companies that figure out how to combine domain expertise with data science will pull away from their rivals. Web Fact : 2.5 exabytes of data created daily and this number is doubling each 40 months Retail fact : Walmart collects 2.5 petabytes of data every hour from its customer transactions Case Study (Retail) - Sears : Sears is using Hadoop cluster for data analytics. (personalized promot

Amazing Pentaho!

I am using PRD ( Pentaho Report Designer ) to build few PDF reports. So far I used to generate hundreds of reports manually from Pentaho BI server and that was my big time consuming task. I came across new transformation output step in kettle "Pentaho Reporting Output" which helps in automating the report generation. Find the sample kettle transformation below in the diagram - Kettle Transformation using Report Output Initially I was trying to integrate PRD 3.8.x prpt file with PDI (aka kettle) 4.3 and could not make it work. Later on with the help from pentaho forum, it is found that all the family are typically not inter-compatible with all versions, esp. backward compatibility can be issue sometime. The simple reason for this would be cost as it is community edition. When I created report with latest PRD 3.9, I could integrate it with Kettle without any hassle. Jaspersoft and Pentaho are two leading open source BI solutions available in the market. But I found Pen

Hacknight by hasgeek, at Pune

This was my first ever experience of the hacknight! It was indeed a memorable night, and a note of thanks to @hasgeek team for an amazing well organized event, having good food, beverages and shelter! ...little goofing up of address venue direction is forgiven!! I must thank Mahesh as well for pushing a family guy to geek's hacknight. :) The event happened on 14 July 2012 from 02:00 PM to 15 July 2012 10:00 AM, at AmiWorks, Pune. The place is not very difficult to locate. thanks to Zainab. It was a penthouse with all minimum facilities in place. I reached there around 13:45 hours, and following true Indian Standard Time I made it 15 minutes earlier. A few hackers had already arrived and some had already started hacking the ideas! I got my self registered, pay'd the registration and fees. And then we started socializing around. During discussion, I was planning to join the group of Navin or Piyush but later found that both of them came just to kick start the event. But

Use MySQL Information Schema carefully!

While working on a procedure which was running on dynamically created data table to derive one complex number, I observed one strange problem where MySQL was continuously eating memory and in spite of 35GB RAM, it was running out of memory after few hours. After debugging the algorithm, it is found that we were querying on information_schema.columns to get the column name for each calculation and after 10,000 queries, it MySQL would start crawling and we had to restart it to release the memory. The issue is fixed by adding one static table containing column names which would be populated each time the data table got created dynamically. But then it is interesting to observe this behavior of system table which uses the cache but does not close it fast! In fact one can optimize the usage of information schema referring  MySQL official manual .

Big Data analysis on Cloud

Its very exciting to share a successful implementation of Map Reduce  framework on Amazon's AWS infrastructure! This was for a fortune 500 beverage company where input data comes from couple of different market research companies like Nielsen. The first step was to get rid of Nielsen's proprietary client Nitro and get more control on the monthly data analysis by storing it in MySQL database. While doing so, already we brought down the data analysis period from 5 weeks to 2 weeks. Now my team has implemented a MAP REDUCE architecture of  distributing  data processing by adding parallel worker nodes for ETL (happening on windows) and data analysis (on Linux) and which in turn would be aggregated on a master node which releases data to dashboard. It is represented pictorially below: High Level Architecture To publish the data, one leading online charting system, iCharts is used which supports various input data format giving great flexibility to the users.