State of Data #78
December 16, 2011
#analysis – “Data Informed, Not Data Driven” – How analytics play a critical role in Facebook design decisions.
Why photo upload “success rate” is so important on Sundays (people do ‘important’ stuff on weekend and upload 150% more pics on Sundays);
‘..very difficult for a set of metrics to fully represent what you value’
#architecture – How to build a data mining web app –
“Well, here is a source code that deploys your app with one command on Google App Engine. You just need to focus on where to get the data (ETL), what to do with it (DM), and how to display it (VISUALIZATION). The source code has example that you can swap with an idea of your own.”
#big_data – How 2012 will be for Big Data – Predictions –
a) Technology – 5 Big Data Predictions for 2012 – Streaming Data Processing
b) Volume – From IDC – “Big Data will earn its place as the next “must have” competency in 2012 as the volume of digital content grows to 2.7 zettabytes (ZB), up 48% from 2011”
c) Effectiveness– From Gartner – “Through 2015, more than 85 percent of Fortune 500 organizations will fail to effectively exploit big data for competitive advantage.”
#Data_Science – Analyzing email to find close friends
‘..took all the e-mail data from an international firm and for one of its offices asked employees to list the people in their social network, dividing the list into friends, colleagues, and acquaintances.
Then Uzzi and Wuchty scanned the workers’ e-mails, for each recording sender and receiver, the time it was sent, and the time it took for the receiver to respond.
The researchers found that both methods—the volume threshold and the response criterion—did a fair job of approximating the social networks the employees had reported themselves.
But then Uzzi and Wuchty tried something new. Instead of looking at the absolute values for volume and response, they looked at the response time.. The new method predicted who was in different employees’ social networks with an accuracy that is several percent higher than the other methods’’
#DBMS – Cloud Storage Benchmark (PDF)
#idea – Management by Statistics
“Lots of folks play fast and loose with statistics to make political points. If I told you the United States has lost most of its manufacturing jobs, is that a problem? What if I told you the United States manufactures the most in the world, but manages to do so with the fewest number of people? (Much like how the U.S. produces the most agricultural goods, but uses very few people to do so) Would you still think that is a problem? You could argue this either way, of course, but the point is that the same observable reality can be presented in various ways, thereby slanting the story.”
#learning – How to ‘cook’ with Data – “Same data, same map, different stories”
“As you can see, for each definition of class limits you get a different message. Most people just use equal intervals, but that’s lazy, IMHO. Using equal intervals in a choropleth map is like sorting a bar chart alphabetically. The only thing that is worse than equal intervals is equal intervals plus round numbers.”
#visualization – What the world searched for in 2011 – Casey Anthony to Fukushima
- !Representativeness Heuristics – “500 died in flight accidents in 2011, & 500 women from pregnancies in the last 12 hours”
- Transportation check-ins at FourSquare
- Tokenizing the entire Web’s English text
- #math Palindrome Equation, valid even when each term is squared: 12 + 43 + 65 + 78 = 87 + 56 + 34 + 21