State of Data -#46

#analysis – Wow!’s latest letter to shareholders (PDF) – from Jeff Bezos – mentions “data” 13 times and starts as if it is a Data Company —

Random forests, naïve Bayesian estimators, RESTful services, gossip protocols, eventual consistency, data sharding, anti-entropy, Byzantine quorum, erasure coding, vector clocks … walk into certain Amazon meetings, and you may momentarily think you’ve stumbled into a computer science lecture.’

#architecture –  
22 Free Tools (e.g., Data Wrangler, Google Refine) for Data Analysis & Visualization – notice how the browser has become default IDE for data. 


#big_data – How Orbitz uses Hadoop to complement their Enterprise Data Warehouse Strategy – Presentation made in Chicago Data Summit on 04/26

#DBMS – Core vs. Thread – Craig Shallahamer was intrigued by ‘a couple examples where database servers have exhibited more power then the number of CPU cores could provide, which implies threads are indeed providing some extra power…so some investigation seems warranted

#learning – Visual Gotcha – Why bridges look totally non-realistic on Google Earth – “The images are the result of mapping a 2-dimensional image onto a 3-dimensional surface. Basically, the satellite images are flat representations in which you only see the topmost object—in this case you see the bridge, and not the landmass or water below the bridge. However, the 3D models in Google Earth contain only the information for the terrain–the landmass or the bottom of the ocean.

#visualization –
Grand winner for ‘Visualize Your Taxes’ (earlier mentioned here) is, just enter your earning and see where your tax money is being spent!



State of Data – #45

#analysis – Data Analysis in action ‘Yahoo Search Revenue Disaster

#architecture –   Metrics Everywhere’ (entertaining PDF) from Yammer folks – on how to make better decisions using numbers. Ways to measure – Guages (# of cities), Counters (# of open connections), Meters (# of req/sec), Histograms (# of cities returned – percentile), Timers (# of ms to respond); and Vitter’s Algorithm.

#big_data – IEEE VAST 2011 challenge – three mini-challenges on Epidemic Spread, Cybersecurity, Text Analytics aggregating up to a ‘Grand Challenge’ that combines all data sets.

#DBMS – Here’s a pretty good noSQL “book” or compendium (PDF) from folks in Stuttgart University, Germany – from Basic Concepts to detailed comparison between MongoDB / CouchDB / Dynamo / Cassandra etc

#learning –  Split-Apply-Combine’ strategy for Data Analysis (PDF) ‘where you break up a big problem into manageable pieces, operate on each piece independently and then put all the pieces back together’. [Journal of Statistical Software] 

#visualization –
How Switzerland Federal Statistics Office changed the game in Census – ‘only a small population is now surveyed by phone or in person’



State of Technology – #7

#at_other_places –

#architecture – 
What exactly is “thread” in CPU – Ars answers well

#code – 
If you’re interested in application performance, this is a must read – Dynatrace folks explain the White Box Testing Best Practices for Regression and Scalability Analysis

#design – 
Better than books, or even schools – “Design for Startups” is a 24-year old’s evolving awareness recording on design. Great, great resource.

#essay – 
This time it is not just a big article, but 50 fascinating things someone at read recently. E.g., “Groupon saw its revenue surge to $760 million last year from $33 million the previous year.

#saas – It’s always worth asking this question, we are never more than 1 release away from gnarly problems – ‘What we did wrong’ – NPR on its API architecture

#social – Why the password ‘this is fun’ is far more secure than a six mixed-characters with a special char

#tool – Some really clever CSS tools and techniques, personal favorite is CSS3 Keyboard.

#tweaks n’ hacks – 
Too busy to get out of Unix? We now have “Google Shell


#parting_thought – We all knew it! Surprised about a bad review (or, decision)? Just try it after the reviewer had lunch.

State of Data -#44

#analysis – ICWSM 2011 published the list of accepted papers. ‘4chan..Analysis of Large Online Community’ (PDF) is interesting for Large Data Practitioners. Folks from MIT analyzed over 5M posts to ‘quantify ephemerality’. Median life of a thread is ~4min; longest lived thread in sample was alive for only 6.2 hrs (think that w.r.t. identify-enabled sites). The authors found ‘anonymity promotes disinhibition, mob-behavior’ but the disinhibition worked better in ‘advice and discussion threads’.  NSFW language warning for contents reported verbatim from /b/ or 4chan in the paper.

#architecture – Take a sneak peek inside World’s 10 largest Data Centers


#big_data – Visualizing News Data for Defense Research & Intelligence Analysis – ‘take terabytes of data from 5000 sources and make it actionable’ (using this editor’s favorite viz tool, Spotfire) – nice 46 min presentation with Q&A later

#DBMS – Running Red Hat, Oracle, and new Xeon processors? You may get about 10% better performance by enabling Turbo Boost

#learning –  Automated Processing of WikiLeaks cable showing friends (Green dots), foes (red), and passersby (teal and blue) – original Stanford Class Project here (PDF). They foundSpain to be US’s most important ally

#visualization –
Beauty of Map – the entire BBC series is now available to watch



State of Technology -#6

#at_other_places –

#architecture –
How a ‘rapidly growing site’ – Posterous – optimize cache performance

#code –
What are the best CS books for ‘self-taught’ programmers. Isn’t it amazing to realize one can finish the entire MIT courseware – free, right from home – set yourself a timetable and weekly schedule, just like in UNI (ie. every Saturday plus Tuesday and Thursday nights) and work through

#design –
Design Essentials for Developers” (PPT) was one of the highest rated presentations in the just concluded Web 2.0 Expo in SF

#essay –
How Great Entrepreneurs Think – researchers arrive at ‘effectual reasoning’ – ability to brilliantly improvise without setting concrete goals. In contrast, corporate executives use ‘causal reasoning’ – start from a fixed goal.

#mobile – Mobile Boilerplate is a ‘Best Practice Baseline for your Mobile App’ with “wow factors” like HTML5 offline caching, cross-platform compatibility

#tweaks n’ hacks – A “Binary Low Table” for your living room?


#parting_thought – The new media and technologies by which we amplify and extend ourselves constitute a huge collective surgery carried out on the social body with complete disregard for antiseptics.’ – Marshall McLuhan


State of Data Last Week -#43

#analysis – How I am doing Compared to Other Companies’ (PDF); Cindy Alvarez from KISSMetrics analyzes anonymized data and insights on its customer base at Web 2.0 Expo 2011. e.g., median SaaS subscription conversion was about 2%, highest was 9% during Feb, 2011

#architecture – Data Liberation Front!
Every ecosystem should ideally embrace this in ‘Data Strategy’. Focused effort, itemized list on ‘how to liberate your data from (or to!) any Google Product’.

#big_data – How to use Data to Tweet to engage most customer base, if you are a business, ‘tweet late, email early and don’t forget about Saturday’. Re-tweet sweet-spot is around 4PM. ‘The Science of Timing’ from HubSpot mining hundreds of millions of tweets.

#DBMS – (From our Solid-State analyst friend Wes Brown) Solid-State Storage Deep-Dive (PPT) is by far the most succinct presentation covering different SSD dimensions – Read vs. Write; NOR vs. NAND; SLC vs. MLC. Was recently presented in ‘SQL Saturday’

#learning –
Data Visualization for Web Designers was another good talk at Web 2.0 Expo SF 2011 by the folks behind Oakland Crimespotting.

#visualization –
700 BILLION minutes spent on Facebook a month; 73 items are ordered from Amazon a second; 1.3 ExaBytes of data exchanged by Mobile Internet users – great visual from Good Magazine


  • Why Data is the next big hope for Ashtma sufferers (Economist, Apr 7)
  • Winner of Data 2.0’s ‘Startup Pitch’ was Micello – “Google Maps for Indoors”. It offers indoor maps for malls, stadiums, airports etc. (; Data 2.0)
  • Europe’s Biggest Ever ‘Public Data Challenge’ launched, 20,000 to be won (
  • Again, correlation is not causation. ‘68 percent of people who understand HTML prefer nonfiction, compared with 48 percent of people in general’ – One Correlation a day (

‘Sentiment Analysis’ – How to deal with Systemic Rebellion?

“Sentiment Analysis” is the fancy technical name of summarizing subjective group opinion about something into, usually, a quantitative rating. e.g., How can we tell if people love “French Laundry” more than “Chef Liu” from Yelp reviews? One obvious source to run such analysis is social networks, Amazon/Netflix/Yelp reviews, TechCrunch comments etc.

A while ago, I came to a realization after analyzing 1000+ book reviews – sorted by date – that books tend to get higher, more positive reviews near the release time. The wider known the author is, the better immediate review the book seems to get. This seems to make intuitive sense. Most early readers wait long time to read, say, newest Harry Potter. Some of them also fall under the spell of “sunk cost fallacy” — the same reason you see more Fords on the street immediately after you buy one. The time-money-emotional investment often tend to overlook any weakness in the new book of the popular author.

However, this model of mine just turned upside down. I was eagerly waiting for one of my favorite fiction author’s new book. It is released today and it should be in any moment from Amazon. I casually went to check Amazon reviews — it typically is 5 stars on the first week (same for James Patterson too!). Except this time, the reviews were scathing. 35 out of 44 reviewers gave it ONE star — none of it for the story or the writing. It seems Amazon was charging more for the Kindle version than hardcover. Users came together in the “Discussion Forum” and decided to send a clear and consistent message by drubbing the book. Some of the reviewers in fact apologized to the author for bad reviews without even reading the book.

So, how could I take care of this challenge during ‘Sentiment Analysis’? Normalizing the “extreme reviews” with consistently similar tonal score (rant against Amazon, in this case) looks one possible way.  Treating seasonality of reviews / feedback — is it unnaturally clustered around a time-frame – is another. Hooking up with other ‘networks’ (in this case the data from forum) for the same identities could be yet another way. A very interesting challenge indeed.

At the very least, I would be more careful before stating ‘earlier book reviews are always more forgiving’ in public from now on!