I recently got certified with cloudera (CCAH) and found myself the only one in class who really uses Hadoop. I often stumble upon people trying to sell Big Data appliances to brand owners “to store and monitor their social network XXX” where XXX may be replaced by any word that tickles or frightens brand owners.

My personal opinion is, if you are a brand owner or not, doesn’t matter, data matters. And only data generated and maintained within your own company matters. If you want to throw away lots and lots of money, than you can start storing data which is freely accessible to anyone.

It seems to be the new uprising trend that all data from all possible sensors emerging  inside a company shall be collected forever to later be crunched by a data scientist or a business intelligence person (whatever that is in particular). After that mastermind has crunched all that data, he will bring you a magical solution to all your (company’s) problems.

A free advice: They will tell you to fire people, save useless costs (not including them) and in general do anything a bit more thin (not including saving data).

Let’s get back to the point, what is big data ? 1TB ? 1TB a day ? 1TB an hour? or is it as simple as …

Twitter User @DEVOPS_BORAT claims that BigData is anything that crashes Excel

To be honest with you, I don’t know it either. Its very dependant on how you’re using the data and not as much about the size as you would imagine. If you’re talking webserver-logfiles which has to be saved for at least a year for legal reasons and they will be not touched in most cases until they are deleted, you probably just want to buy some tapes for that. If your logs make all your money and you make even more money from aggregating them back to your application, you really want to be able to scale that and the output is messing up any single box trying to aggregate that data, then welcome to your new big data appliance.

In general, at least with Hadoop and the Hadoop Ecosystem, we’re talking about a method to do large aggregations on very large datasets, that would not be thinkable on single machines and can scale by just adding more metal. It is much less expensive to just buy storage to store data, so if you’re just storing data, buy a matching storage and use “grep”, if you want to gather informations from that data.