State of Data #107
July 13, 2012 Leave a comment
‘Nobody ever got ﬁred for using Hadoop on a cluster’ (Microsoft Research)
“We analyzed 174,000 jobs submitted to a production analytics cluster in Microsoft in a single month in 2011 and found that the median job input data set size was less than 14GB… Facebook jobs follow a power-law distribution with small jobs dominating; from their graphs it appears that at least 90% of the jobs have input sizes under 100 GB. We therefore believe that there are many jobs run on these clusters which are smaller than the memory of a single server.”
Highly insightful paper from Microsoft – how product measurement (e.g., A/B testing) often deceives
7 Startups trying to solve your Big Data Problems
‘Introduction to Data Science’ (free) Book
SQLMap : Automatic SQL Injection tool
How natural attractiveness of Normal Distribution makes people build elusive models for random, ‘Black Swan’ events. Or, why we made large-scale ‘financial crises’ unavoidable.
“Now for an abnormal question: to what extent is normality actually a good statistical description of real-world behaviour? Evidence against has been mounting for well over a century.
In the 1870s, the German statistician Wilhelm Lexis began to develop the first statistical tests for normality. Strikingly, the only series Lexis could find which closely matched the Gaussian distribution was birth rates. The natural world suddenly began to feel a little less normal.”
Everything you wanted to know about Machine Learning under 30 minutes – a talk from Hilary Mason
‘The talk is geared toward engineers with no prior knowledge of machine learning, and it’s designed to lay out the basic vocabulary and way that we think about the world to provide an amusing foundation. This talk is not an in-depth tutorial.
The Blue Economy – Visualizing Fishing, Transport, Energy & Cities