State of Data #75

#TimeFlies – This is 75th episode of State of Data Newsletter. Thanks for your support and feedback. No SoD will be published next week, and during the 2 weeks’ of Christmas, like last year. Happy Thanksgiving in advance!


#analysis – Understanding Uncertainty: How to visualize probability’ – excellent article from +2

#architecture – Storage Infrastructure behind Facebook Messages

#big_data – “3 hrs in the life of a (glorified) Data Scientist” – we really spend more time in transport and re-transport of data because of formatting / quality issues than we do eating over our life!

#contest – 
Predict photo quality, help Pete Warden, Win some money, Better your Machine Learning skills. #PerfectWorld

#Data_Science – Quick Start in Tweet analysis using R

#DBMS – How to easily check for phone number using MySQL

#idea  How many iPads it would take to match world’s fastest supercomputer? 61 million if you are using iPad 2.


#learning – All possible SQL Injection use cases to try out (without any damage)


#visualization – Sweat Map – A chart for Marathoners



State of Data #74

#analysisWhat’s wrong with BBC’s ‘Bowel Cancer Map’?

‘Adding up the populations for each area, calculated from the deaths and death rates, I get a UKpopulation of 89.75 million.  Since the true figure for 2008 was about 61 million, that’s rather surprising. ‘

#architectureMongoGate – or let’s have a serious NoSQL discussion” based on the original rant/expose (depending on your POV).

Instead of using the relation model, what the NoSQL movement brings to the table is you can now choose from other ways to reach your own hell. You are free to pick KV-stores, Document DBs or other more complex ways of expressing yourself (beneath some SQL-stuff of course).But does this really transcend the current state of the art? Is this really different from SQL-based systems?!”

#big_data‘Big Data’ is essentially about solving performance problems

#Data_ScienceCloser look at Oracle Big Data Appliance

#DBMSCould you use only SQL to ‘find a secret message hidden in a seemingly random collection of words’ (PDF)? How an Australian, a Dutch and a Russian engineer independently solved it.

#idea Is ‘Big Data’ plain evil or just another bubble?

“This is a common characteristic of technology that its champions do not like to talk about, but it is why we have so many bubbles in this industry. Technologists build or discover something great, like railroads or radio or the Internet. The change is so important, often world-changing, that it is hard to value, so people overshoot toward the infinite. When it turns out to be merely huge, there is a crash – in railroad bonds, or RCA stock, or Perhaps Big Data is next, on its way to changing the world.”

‘All your Bayes are belong to us’ – A collection of fun Bayes’ Theorem Problems

#visualizationMethod mined Google Search data to figure out what people REALLY want in a product (e.g., in a tablet)


  • #math In a race between a butterfly and a bat, the latter may finish faster. But, how to compare which one moved the fastest? Strouhal Numberwill help.
  • “NBC’s “30 Rock” rates very highly with European car buyers. Lincoln and Mercury buyers are more likely than other car buyers to watch the Gospel Music Channel.”
    How TV Media planninghas entered ‘The Age of Databases’
  • Most intelligent chatbot two years in a row (Ed: Interested? A great narrative of Loebner Prize, AI and Humanness is in this recent highly readable book)
  • #fromTwitter What is the smallest integer – when written in words – is not identifiable in a tweet (140 char)? Joke/tweet take on Berry Paradox is this.

State of Data #73

#analysis‘How good is your data’ for analysis? ‘It points to the “garbage in garbage out” problem. One should always be aware of the potential hazards. (Hat tip: Kaiser Fung) The Murky world of Student Loan StatisticsAnd yes, the new numbers will show that student-loan debt exceeds credit-card debt.”

#architectureHow InstaGram stored hundreds of millions of key-value pairs in Redis – “Fit the data in memory, and ideally within one of the EC2 high-memory types (the 17GB or 34GB, rather than the 68GB instance type)

#big_dataChoosing the Right Data Storage Solution – for Un/Semi/Structured Data  

The all-in-one, new Guide to Data Compression

#DBMSServing 1M daily users, 100K DB operations/sec with no Cache – journey from mySQL to REDIS

#idea SaveUp – “Unlike most loyalty programs out there that give you credit based on how much you spend, SaveUp rewards you for how much you save” 

Technical papers on optimizer

WhatsUp – Timeline view of most popular Tweeter topics (takes a while to load)