State of Data this year – 2010

There would be no newsletter on Dec 24 and Dec 31. Last issue of the year is a thus a ‘triple dose’. Happy Holidays!

This could be dream job for someone passionate about data – Intuit is looking for Data Scientist, please tell your best data friend about it!


  1. Expedia removed an optional field (Company) from “Buy Now” page – it cost $12M profit a year otherwise.
  2. Why you should look beyond typical “Insights from Top 10” and adopt “Weighted Sort”. Avinash is also the author of two best-selling (and ultra great) books ever on Web Analytics.
  3. LinkedIn analytics team looks under the hood of Signal (heavily using Lucene) – social search for LinkedIn-Twitter accounts
  4. How to identify and stay clear of “Faux Marketing Metrics” serving no real purpose
  5. Why Macy’s loaded its Atlanta stores with hats
  6. Facebook profile photo angled at 15”? Fun-loving. 16”? Uh-oh. Risky business – Fast Company analyzes.
  7. Google Refine 2.0 – a great BI tool for “Data Wranglers”?
  8. Top 5 Free, Open Source Data Mining Software
  9. Machine Learning – A Love Story


  1. Architecture of the Month – Salesforce
  2. Very large-scale data analytics using Google’s Dremel (PDF from 36th international VLDB conference, 2010) or, how to use SQL to query (nested columnar; *not* relational records) tables containing 1 trillion+ records.
  3. 37Signals’ database elevator pitch – (1) Use Solid State Disks; (2) Delay sharding as long as you can.
  4. LinkedIn Data Infrastructure
  5. SaaS vendor Workday’s interesting database architecture – Entirely in-memory; “Workday’s database, if relationally designed, would require “1000s” of tables”
  6. How CERN (LHC) manages 20PB of data in JBOD (Just Old Bunch of Disks) hooked up to Linux boxes
  7. From MIT’s best-ever Data Management checklist. Every project should have this, or a similar one, validated at start!
  8. Peek inside Microsoft’s large scale online services (PDF; Hotmail, Bing, Hadoop). Bing servers are CPU-bound as it (uses) “data compression on memory and disk data .. causing extra processing“. Google uses a home-grown “fast” compression algorithm that “minimizes number of shift operations during decompression”.
  9. Netflix move to Amazon Cloud /SimpleDB mainly for HA is fully documented now. Interesting use cases were solved – e.g., customer’s mobile phone was used as primary key to avoid sequences. All constraints were gotten rid of.


  1. Cary Millsap, Performance Guru, explains in ACM why “99% of the X workflow executions should finish under 100ms” is a better requirement statement than “the average response time should be 200ms”. 2nd part – “Thinking Clearly about Performance” in ACM
  2. Database Abnormalization 101
  3. Nati Shalom evaluates role of SQL in a supposed post-SQL world.
  4. Learn how Memcached works in 15 minutes – from a story.
  5. Thinking to add indexes defensively? 31 indexes on a 340 column table could slow down inserts 8x.
  6. Facebook– how they deal with 13M queries per sec.
  7. Typical Database Development Mistakes made by Application Developers


  1. NOSQL Patterns – finally!
  2. Great deployment of noSQL (analytics) and RDBMS (payment) together.
  3. 10 common Hadoop-able problems (from the guy who’d done it first)
  4. Michael Stonebraker clarifies his position – CAP theorem should not be justification to give up on ACID. Also “rm –rf” type errors cannot be recovered using CAP theorem terms.
  5. A great architecture diagram if you are in hurry to convert relational to noSQL
  6. Too funny to not watch it (NSFW at places) – MongoDB is Web Scale


  1. Twitter database took 12 hrs to restart making the whole ecosystem “unusable”.
  2. Teradata Product Strategy – intelligently split databases among flash (SSD) and disk
  3. Cost of JP Morgan Chase Oracle outage (suspected) caused by EMC hardware – $132M ACH transfers held up; 1000 auto & student loan applications were lost.
  4. Ebay throws out 6.5 petabyte GreenPlum, moves it into TeraData.
  5. HP practically giving up on BI and Analytics Market?


  1. The Numbers Behind Numb3ers – Interested how serial killers can be caught with data? Even if you do not Bayesian analysis for living, you would love how this book goes through the whole gamut of data techniques in Crime Solving.
  2. Beautiful Visualization – Why color is the Cinderella of Data Viz? In line of great “Beautiful Data” (2008) this book analyzes why London Underground Map is so easily understood why New York Subway map is not.
  3. Head First Data Analysis – Somewhat irreverent but a total fun and quick read (just like Head First Design Pattern). Absolute must read before buying first home or why “heuristics are a middle ground between going with your gut and optimization
  4. Expert Oracle Database Architecture: 11g Programming Techniques; 2nd Ed. – Putting “Oracle” in this book’s title was really unfortunate decision. Any data practitioner will add at least 5 years of work experience just by doing case studies in this. Oh, it applies equally well to MySQL, Cassandra, CouchDB..<Put the latest fad here>
  5. Want to learn how to write efficient SQL from the master who created it all? An excellent 16-hr ‘SQL Master Class’ video course from Chris Date shows how to avoid common traps and pitfalls.
  6. Mining of Massive Data Sets– Stanford CS. For mining not so massive data sets, just guess 😉
  7. Elements of Statistical Learning *FREE* – Again from Stanford CS. A great text-bookish reference if you want to get started.
  8. Data Mining lecture notes from MIT. Great case study!
  9. Free (legal!) book on Probability and Statistics with R. (note – this is introduction to Statistics and Probability using R; NOT Introduction to R using Statistics and Probability)
  10. Check out the book – ‘SQL Antipatterns’. If you’d liked Martin Fowler, you’d love it as well. It’s just been out less than a couple of months!


  1. Avinash Kaushik, writer of two best-selling books on Web Analytics blogs at “Occam’s Razor
  2. No one knows better optimizer and access path analysis than Jonathan Lewis. He blogs to try scientific methods over opinion to solve real-life use cases especially w.r.t. scalability and performance.
  3. Facebook Engineering Notes has become one of the most popular technical blogs, often pondering on large data problems.
  4. MetaOptimize Q+A is by far the best blog / feed for Machine Learning, Modeling and other analytic problem segment.
  5. The Daily WTF not only makes you think “How they can do THAT” but secretly makes thankful that they don’t know about that mistake of yours!


1. Hadoop Summit 2010 presentations are now available – some great sessions were
2. R User Conference July 20-23, 2010 are available now. This has a great introduction “High Performance Computing with R”.
3. Surge 2010 was a “scalability, databases and web operations” conference for web 2.0 entities that are growing *really* fast
4. “noSQL took away relational model and gave nothing back”? Very interesting thread (real developers’ feedback, with some religious war-flames too) launched after ‘noSQL evening in Palo Alto

  1. How Florence Nightingale collected data, presented statistical graphics (incl Bar Charts!) and brought in immense improvements in health standards.
  2. What rises twice a year – once in Easter, and then 2 weeks before Christmas; has a mini-peak every Monday and then flattens out over the summer? Peak break-up windows (according to analysis of vast Facebook updates) – David McCandless explains in TedTalk
  3. 500 years of science progress as a Subway Map
  4. What is Marijuana really worth – apparently real-time mash-up of price data.
  5. Data for Date – Real-time Guy:Girl ratio in bars and other places.
  6. Hipmunk visuals for Airline reservation – available flights are aggregated over a single chart, and default-sorted by “agony” (price; duration; layover; red-eye).
  7. Facebook intern visualizes friendship pattern
  8. United States of Auto-complete is a fun (Google!) way to learn geography


  1. Think before you’re bored – analyzing news data shows April 11, 1954 is the most boring day ever since Jan 1, 1900.
  2. 300 Millions Tweets (also) indicate Thursday (evening) is the nation’s Unhappiest Day
  3. Microsoft sells 7 copies of Windows every second; Facebook serves 1.2M photos a second.
  4. Burglars steal $200,000 mining Facebook data
  5. To become a top 1,000 website you need at least 4.1 million visitors per month. 3 visitors every 2 seconds!
  6. Google slows down your site so others can find it faster? Up to 50% of some Web Servers (CPU) resources are spent processing robot crawlers.
  7. Believers in Alien abduction are more likely than nonbelievers to drink Pepsi?
  8. 52% of US population lives in metropolitan areas (>0.5M people), but such cities only produce 13% of players in NHL; 29% in NBA; 15% in MLB; and 13% in PGA. (Wonderful causal analysis)
  9. iPhone users purchase fish 26 times more than Android users.
  10. A black background of Google would save about 750 megawatt-hours – enough to power 1000 houses for a year.
  11. 1 in a 4 trillion chance? Well just happened twice in three weeks – exact six numbers were drawn at an Israeli lottery.

    State of Data Last Week – Dec 11

    #Analysis – $3M prize to ‘develop a breakthrough algorithm’ using patient data ‘to prevent unnecessary hospitalizations’.

    #Architecture – How much storage do I need for my data? ROT (Rule of thumb) is 8-10x Raw data for OLTP; 2-4x Raw Data for Analytics

    #Big Data – 70 online databases that define our planet – this editor especially loves Reality Mining from the list

    #DBMS – Oracle 11g Interactive reference (from Oracle) – good download for quick access to stuff without Googling

    #Learning – Stanford CS book on ‘Mining of Massive Data Sets’ (PDF) is truly exceptional.

    #visualization – World searched things in 2010 – mostly for Justin Bieber, chatroulette, iPad and FIFA world cup. ‘Swine Flu’ and ‘Slumdog’ both are down.

    (Preview) The genius Hans Rosling will do a one-hour special on ‘Joy of Stats’ in BBC.



    State of Data Last Week – Dec 06

    #Analysis – Logistic regression is a ‘categorical tool’ e.g., telling fraud/not fraud. Here is a great starter with a worked out case analysis in minutes using R on your laptop.

    #Architecture – Top 5 Free, Open Source Data Mining Software

    #Big Data – One single racecar streams 27GB of telemetry data during a race weekend from 200 sensors.

    #DBMS –
    Build your own ‘Circular Log’ / ‘Log Rotation Routine’ with MySQL (or any data storage). With later Oracle releases, ‘interval partition’ is in-built for this.

    #Learning – Want to learn how to write efficient SQL from the master who created it all? An excellent 16-hr ‘SQL Master Class’ video course from Chris Date shows how to avoid common traps and pitfalls. Best of all, it’s completely FREE for Intuit employees using Safari Online.

    #visualization – Logstalgia displays your web access logs as a ‘pong like battle between Web Server and a never ending torrent of requests’. Requests appear as color balls!

    glTail is similar FREE, real-time log-visualization tool – ‘each circle is a hit on website, and size of circle indicates the size of request’.