State of Data Last Week – Nov 20

No SoD newsletter next week. Happy Thanksgiving, Everyone!

#Analysis –Signups increased by 60% — after removing the signup form! Yet another KSU (Keystrokes per Sign-up) victory!

#Architecture – RIP Web SQL Database – W3C Working Group killed it this week L

#Big Data – Secrets of LinkedIn Data Scientists

#DBMS – What happens when you try to UPDATE a record with EXACTLY same values as before? Is it wasted work? Bad news – actual record does get (redundantly) modified; Good News – the index doesn’t.

#learning – Least scary way of learning Ruby (with Zombies). It needs register with email. Nice material (even technical ones!).

#visualization – Mozilla Data Visualization contest is now open with data sets.

How to Tell Stories with Data is a aggregation of tools and methods. Most slides are online now.


  • Tarrot card readers – or life insurers – could tell how long you would live by just looking at cookies at your browser?
  • Housing crash was (also) blamed on ‘dramatic fall in inter-state migration rate’. This analysis (PDF) from FED now shows the fall is not that dramatic; resulted from wrong way of compensating for missing data. Data quality, anyone?
  • How to show off big numbers – ‘annual mileage of U-Haul trucks, trailers and tow dollies would move a family to the moon and back more than 20 times per day, every day of the year’. Tow dollies! Seriously??
  • Mind boggling soccer free kicks – why so difficult for even expert Goal keepers when taken from far? The ball takes sharpest spin when it slows down the most. Pretty much the same for baseball.

State of Data Last Week – Nov 13

<Analysis> To analyze A/B testing the trick is to separate the conversions that would’ve happened anyway. Here’s an intuitive way how to do so.

Google Refine 2.0 – a great BI tool for “Data Wranglers”?


<Big Data> Yet another on Facebook– how they deal with 13M queries per sec. Clue – they derived a distributed replacement for otherwise efficient “Stored Proc”. (Still, 4ms latency for I/O sounds way higher than it should be)

Ever wondered why Oracle had decided to first Aggregate records, and then Merge with other tables for certain queries and do the exact opposite in others? Oracle Optimizer Development team explains the algorithm.

Google Fellow Jeff Dean talks to Stanford EE380 class. “That final feature will drive you over the edge of complexity”.

> What New Yorkers Complain most about during a day. Complaints about Graffiti reaches its peak at 2PM,


  • Here’s one more thing IRS data can expose – Missing Children.
  • Counter-intuitive analysis of the week – Dilbert creator explains why eating breakfast is overrated and how he peaked his creativity eating about 400 calories over 16 hrs. Apparently, optimal time for intelligence relating to hunger is 4 hrs

State of Data Last Week – Nov 06


<Inside Intuit> Database Engineer (10+ years of experience) position open in EMS (Reno, Nevada)

Director, Data Offerings position open in BIO (Mountain View, CA)


<Analysis> Expedia removed an optional field (Company) from “Buy Now” page – it cost $12M profit a year otherwise.

<Architecture> When we like a pastry shop – Yelp uses Amazon Elastic MapReduce (EMR) to analyze (100GB/day) using mrjob – a Python framework to write MapReduce jobs. Yelp has taken down their in-house Hadoop clusters in May, 2010.

<Big Data> A list of references for mining from streaming data – map-reduce is not that great for streaming / non-stored data as user does not know what and how much data is to analyze beforehand. Yahoo’s S4 is quickly getting popular as distributed stream computing platform.

<DBMS> InnoDB – faster storage engine for mySQL – is no more available for free in “Classic MySQL” L

<Learning> Adrian Cockcroft (Netflix Performance Architect; ex-Sun) wrote an amazing set of articles comparing noSQL availability models, and What Netflix needs from noSQL.

<Visualization> Best graphical analysis of 2010 election in 10 visuals comes from New York Times.


  • How long will search be the king? Twice as many people in age group 18-29 discover a product or service through Social Network (Facebook) compared to all age groups of consumers
  • Facebook profile photo angled at 15”? Fun-loving. 16”? Uh-oh. Risky business – Fast Company analyzes.
  • Can mod_pagespeed from Google really speed up your site 2x? Here’s a quick way to find out from another proxy.
  • Stats on P2P file sharing — Larges – 746GB (all 2010 World Cup Soccer matches; ~6GB per 45 min); Oldest – The Matrix Ascii; Most Data Transferred by single torrent – 15.77PB (StarCraft 2)