State of Data #107
July 13, 2012 Leave a comment
Top Read
‘Nobody ever got fired for using Hadoop on a cluster’ (Microsoft Research)
“We analyzed 174,000 jobs submitted to a production analytics cluster in Microsoft in a single month in 2011 and found that the median job input data set size was less than 14GB… Facebook jobs follow a power-law distribution with small jobs dominating; from their graphs it appears that at least 90% of the jobs have input sizes under 100 GB. We therefore believe that there are many jobs run on these clusters which are smaller than the memory of a single server.”
Analysis
Highly insightful paper from Microsoft – how product measurement (e.g., A/B testing) often deceives
Big Data
7 Startups trying to solve your Big Data Problems
Data Science
‘Introduction to Data Science’ (free) Book
DBMS
SQLMap : Automatic SQL Injection tool
Idea
How natural attractiveness of Normal Distribution makes people build elusive models for random, ‘Black Swan’ events. Or, why we made large-scale ‘financial crises’ unavoidable.
“Now for an abnormal question: to what extent is normality actually a good statistical description of real-world behaviour? Evidence against has been mounting for well over a century.
In the 1870s, the German statistician Wilhelm Lexis began to develop the first statistical tests for normality. Strikingly, the only series Lexis could find which closely matched the Gaussian distribution was birth rates. The natural world suddenly began to feel a little less normal.”
Learning
Everything you wanted to know about Machine Learning under 30 minutes – a talk from Hilary Mason
‘The talk is geared toward engineers with no prior knowledge of machine learning, and it’s designed to lay out the basic vocabulary and way that we think about the world to provide an amusing foundation. This talk is not an in-depth tutorial.
Visualization
The Blue Economy – Visualizing Fishing, Transport, Energy & Cities
etc
- Soda or Pop : Analyzing Twitter data to answer
- Pi Bar, San Francisco special – Single Slice Pizza & Beer, $6.28, Every day between 3:14 and 6:28PM
- ‘Data’ – Singular or Plural? Debate continues!
- 22% Web Pages reference Facebook