State of Data Last Week – Oct 30

<Analysis> Here’s one for buy newest Mac ‘netbook’ or gifts for Holidays without raiding the savings account. Get a quick $25K predicting number and type of newspapers / magazines Hearst should place on different newsstands, all data needed is there.

<Architecture> Marrying memcached and noSQL – GigaSpaces now have memcached support.

<Big Data> “noSQL took away relational model and gave nothing back”? Very interesting thread (real developers’ feedback, with some religious war-flames too) launched after ‘noSQL evening in Palo Alto

<DBMS> Things we can learn from StackOverflow’s Data Center Migration & Server upgrade – (a) even if you move up to *much faster* processors, default BIOS settings may yield lower clock-speeds  than before; (b) StackOverflow’s full text query takes about 335ms

<Learning> Palo Alto – How Facebook uses MySQL to serve 0.5B users – Tech Talk on Nov 2 – 7 to 9PM.  Also will stream live!

<Visualization> Experiment of visualizing data using hand and longtime exposure – German population pyramid 1994-2009.



State of Data Last Week – Oct 23

<Analysis> Have diminishing returns set in to investments in higher education? Somewhat counter-intuitive analysis of BLS data thinks so.

Nasdaq releases on-demand historical stock data – nice way to test trading algorithms with accurate, bulk data.

<Architecture> Michael Stonebraker clarifies his position – CAP theorem should not be justification to give up on ACID. Also “rm –rf” type errors cannot be recovered using CAP theorem terms.

<Big Data>
Showing off? Average Hadoop cluster has 66 nodes and 114TB data. Ebay has 8500 nodes, 2PB.

<DBMS> MySQL cluster delivers 180K primary key SELECTs per sec and 120K UPDATEs per sec on just 2 nodes.

Cary Milsap continues his “Thinking Clearly about Performance” in ACM – should you open the window (global stuff) or take off your heavy sweater (local)?

<Visualization> Hipmunk visuals for Airline reservation – available flights are aggregated over a single chart, and default-sorted by “agony” (price; duration; layover; red-eye).


  • Two sides of reference – WikiPedia puts ZERO tracking file in your computer; puts 234 cookies / beacons
  • REST service API is steadily winning the API war over SOAP. 74% of most popular 2000 Web APIs are now in REST protocol.
  • Why having a house numbered < 31 (or, digits adding up to 6) could sell earlier – many unwillingly get ‘nudged’ to live in a house numbered after birth / wedding anniversary.
  • Using auto-increment / serial numbers for entities is a standard practice in modeling. It could also give away a lot of secrets; like, estimating iPhone sales; or help statisticians win (real) wars.

State of Data Last Week – Oct 16

<Cool Numbers> When would we rather type than speak to search (> 6 words); Teens would rather SMS and ask their friends!

  • If search query is more than 6 words (PDF), chances are dramatically less we would use “voice search” in Mobile devices, irrespective of keyboard size. Our short term memory can only remember “Magical Number, Seven, Plus or Minus Two”
  • Adult Content, Lifestyles and Health are least voice-searched categories. No one would typically like to broadcast those; unlike “food and drink” searches (most popular category).
  • American teens send or receive 3,339 texts a month (1 SMS every 10 waking minutes). Teen females send 60% more SMS than teen males. They also are talking a lot over wireless – only adults over 55 talk less than teens in Mobile.
  • IMDB (along with BlockBuster; AOL; Domain name registration) turned 25 this year. Interesting numbers from its database – there’re 62% more actors than actresses; Almost 17,000 movie industry people can not be connected to Kevin Bacon (infinite Bacon Number)

Google Price Index” – Daily measure of inflation from web shopping data.

Better than Myers-Briggs? Plug-in any Twitter handle here to see the emotional analysis.

<Inside Intuit> Refer great data people you know to work with us –

  1. Data Warehouse Engineer @ San Diego
  2. Data Engineer @ Menlo Park

Netflix move to Amazon Cloud /SimpleDB mainly for HA is fully documented now. Interesting use cases were solved – e.g., customer’s mobile phone was used as primary key to avoid sequences. All constraints were gotten rid of.


<Big Data> What’s your unit’s PVS (PageView:Server) ratio? Digg has 400K page views per server; StackOverflow 12M.

Some of world’s top SQL / Oracle Tuning experts + 4*1.5 hrs +  Web + $375 = Highly Recommended “Virtual Conference on Oracle Performance” @ Nov 18, 19.

What’s new in MySQL 5.5 Data Replication

> Data for Date – Real-time Guy:Girl ratio in bars and other places. Too many girls check-in in Indie Movie Theaters compared to guys.
<Cocktail party cheat-sheet> Kindle owners are wealthier than iPad owners. They’re more educated too!

State of Data Last Week – Oct 09

<Cool Numbers> Decline in marriage, can “Perfect 10” date help?

  • For first-time in US history, ‘Never Married’ exceeds Married among young adults – 25-34 age group per US Census data
  • For certain small business owners (Florist, Caterer, Wedding Planner) average invoices are way higher today (10/10/10). Interestingly, 10x people – 39,000 couples – are tying the knot compared to last year’s Sunday. Next surge – 11/11/11
  • Iowa State University thinks a murder costs society $17.25M -commenter refutes.
  • Why Data Services are not so easy – each tweet (144 characters) typically becomes 1000 bytes when accessed over the detailed (‘FireHose’) API

Why Macy’s loaded its Atlanta stores with hats

World Bank Data Challege.

<Inside Intuit> Refer great data people you know to work with us –

  1. 1. Director, Data Strategy (IFS)
  2. 2. Statistician / Data Analyst (Payments)

<Strategy/Arch> ‘Data Scientist’ does – Obtain; Scrub; Explore; Model; and Interpret

Sometimes all we need to do is ‘Duct-tape Architecture’ – just scaling it to a point beyond breakage. Great example here with scaling with SSD

<Big Data> Ebay throws out 6.5 petabyte GreenPlum, moves it into TeraData.

<Schema> How to do MapReduce with good old Oracle database

<DBMS> Impending MySQL price increases.

<Visualization> Visualizing source-code – how Apache differed from PostgreSQL

<Cocktail party cheat-sheet> Foursquare outage happened because one of the shards had suddenly more data (67GB) while the RAM was only 66GB (Amazon EC2 virtual). Apparently, MongoDB barfs if it has to read from disk, even little ‘I checked into Restaurant X’ details (each about 300 bytes).


State of Data Last Week – Oct 02

<Cool Numbers> Data can ban grunting in tennis; Are IMDB female users far less tolerant of bad movies; $ value for 0.1% increase in employee engagement.

  • Female IMDB users are far less tolerant of bad movies compared to male. For “Top 50” movies, male:female rater ratio is about 5:1. For “Bottom 10” movies, male:female rater ratio decreases to 3:2
  • Opponents of Tennis players who grunt during serve (e.g., Maria Sharapova – 100 decibels) get significantly slower (21-33ms). In professional tennis, this delay translates to the ball travelling 2 extra feet before the opponent can otherwise respond.
  • Talent Analytics – Starbucks and Best Buy can apparently identify the value of a 0.1% increase in engagement among employees at a particular store. At Best Buy, for example, that value is more than $100,000 in the store’s annual operating income
  • Measuring unique visitors, Paypal was #1 US FI (Financial Institution) in August. It was visited by roughly 5 million more unique visitors than the 2nd (Chase)

<Analysis> How to identify and stay clear of “Faux Marketing Metrics” serving no real purpose (including, may be, on this newsletter).

<Strategy/Arch> The worst metaphor in Cloud Computing is apparently “Cloud in a box” (Thanks Oracle!). NoSQL is #9, and “Cloud Computing” itself is at #15.

<Big Data> LinkedIn analytics team looks under the hood of Signal (heavily using Lucene) – social search for LinkedIn-Twitter accounts.

<Schema> How Google did incremental real time search with “Percolator” (on BigTable) and reduced average age of documents 50% – excellent white paper by Googlers from UseNix 2010. Percolator is shown to be 1000x faster than traditional MapReduce.

<DBMS> How is the contention to read the same block another node is writing at the same time processed in Oracle RAC – the “gc buffer busy waits” and the tactics to address it really explained well.

The very sub-optimal way to convert IP address to an integer in MySQL (and to write it in a book!) is shown here.

<Visualization> Crappy iPhone signal is not a problem anymore. Bring Chuck Norris closer to the iPhone, all bars will light up instantly!

<Cocktail party cheat-sheet> Thinking to add indexes defensively? 31 indexes on a 340 column table could slow down inserts 8x.