State of Data Last Week – Aug 29

Cool NumbersWant your kid to play in Major League some day? Move to a smaller city!

Cary Millsap, Performance Guru, explains in ACM why “99% of the X workflow executions should finish under 100ms” is a better requirement statement than “the average response time should be 200ms”.

CouchDB – The Definitive Guide” book is now available FREE! Chapters on “Eventual Consistency” (with a case study) is recommended (Hadron Collider uses CouchDB)

3 rules of thumb for using Bloom Filter – “..add 1024 elements to 1KB bloom filter, ..with ~2% false positive rates”.

O’Reilly Strata (‘The Business of Data’) Conference Call is now open (Feb 1-3, 2011)

Data VisualizationStock heat map of performance of common stocks traded on Nasdaq (Thanks to Karin!)

What’s the solution to “Data Glut” or information overload? Use your eyes more! David McCandless explains in TedTalk why in the month of April there’s a media surge on reports on violent video games.

Best Science and Health Infographics from New York Times (Social web of Smokers and Measure of a President are cool)

Cocktail party cheat-sheet – “Data is the new oil soil” – the new fertile medium.
Why Trader Joe’s is so different? It stores less than 4000 SKUs compared to 50,000 SKUs stored by typical grocery stores. So, it sells $1,750 worth of items/sq ft (2x of Whole Foods).
Same magnitude earthquakes will kill up to 200x less in a democratic country than in dictatorships.

State of Data Last Week – Aug 22

Cool Numbers

What exactly is “Data Science” and why the future belongs to folks turning data into product.

SaaS vendor Workday’s interesting database architecture – Entirely in-memory; “Workday’s database, if relationally designed, would require “1000s” of tables”

Excellent guide to performance (index strategies) for data developers – time to replace the overused metaphor – “index is more complex than a phone book because it undergoes constant change .. phone book doesn’t provide enough space to write a new entry between to existing ones

Hadoop ecosystem World Map – e.g., hiho and Sqoop for loading RDBMS data into Hadoop.

bashreduce – MapReduce in a bash script. Split your unix command across many machines. Brilliant!

Teradata Product Strategy – intelligently split databases among flash (SSD) and disk

Data Visualization – We all know how visualizing sort algorithms look like, but “What different sort algorithm sounds like” takes a cake. Merge-sort sounds so 80-s ;-)

Map of worldwide undersea cables
Cocktail party cheat-sheetMembase is the 500,000 ops-per-second database powering Farmville.

State of Data Last Week – Aug 15

Cool Numbers – 1 and only 1!

Data is art. Martin Wattenberg’s (Flowing Media) presentation in MIT World (check out the similarity of “How to be” and “How not to be” search; FleshMap)

10 common Hadoop-able problems (from the guy who’d done it first)

Don’t fight, Integrate – Great deployment of noSQL (analytics) and RDBMS (payment) together.

How to solve Netflix recommendation challenge with (a) huge, yet, (b) limited data
Data Portability -
How opening up of data led to progress on Alzheimer’s

How did Weather Data get opened

A new Taxonomy of Social Networking Data (if you already know what’s “Incidental Data”, don’t bother ;-))

Data Visualization – How Florence Nightingale collected data, presented statistical graphics (incl Bar Charts!) and brought in immense improvements in health standards.

Cocktail party cheat-sheetWhat does your Credit Card provider know about you? If you use your credit card at dentist’s, you’re 4x times less likely to miss payments in next year compared to one who uses his card at a drinking place.

State of Data Last Week – Aug 08

Cool Numbers – It’s depressing!

Analysis or Dare?
(a) People who swat flies have a thing for US Today?
(b) Believers in Alien abduction are more likely than nonbelievers to drink Pepsi?

(c) People who cut their sandwiches diagonally are more likely to wear RayBan?

(d) Southerners bought more white, green and pink than other regions’ residents?

(e) E.R. care accounts for less than 3% of healthcare spending?
Talking of analysis, offloading Predictive Analysis to Google API – Machine Learning as a Service, all you need to know.

Ben Horowitz on “Taking the mystery out of Scaling a company” – a lot applies to scaling technology as well. e.g., replace “people” with “server” in the snippet –

“when adding people server into the company feels like more work than the work that you can offload to the new employees servers”

Bit of nostalgia for those who’d worked with “large, 12MB databases” (or before) – Oracle 5 Installation Live

NOSQL Patterns – finally!

CouchDB has started losing data or, at least, started admitting it.
“ once the bad code path is triggered, subsequent writes to the database are never committed. This means there is potential data-loss for users of 1.0.0.”

Is there a LAMP framework equivalent for Big Data Processing?

Architecture of the Month – LinkedIn Data Infrastructure

Data Visualization – So you want to watch YouTube flowchart (click on the picture for large version)
Automotive Family Tree – really cool!

Cocktail party cheat-sheet –Even the cheapest SSD (Solid State Disk) is 5-7x faster at TPC than usual drives (for PostGreSQL)

State of Data Last Week – Aug 01

Cool Numbers – Change flows through the free WI-FI, finally!

noRestart? Last Monday, Twitter database took 12 hrs to restart making the whole ecosystem “unusable”.

Papers from R User Conference July 20-23, 2010 are available now. This has a great introduction “High Performance Computing with R”.

Free (legal!) book on Probability and Statistics with R
(note – this is introduction to Statistics and Probability using R; NOTIntroduction to R using Statistics and Probability)

37Signals’ database elevator pitch
– (1) Use Solid State Disks; (2) Delay sharding as long as you can.
New version of the database SQL Anywhere 12 is now publicly available. Developer free edition can be downloaded here.

Business of Data
– Google and CIA co-funding the same data mining startup – Recorded Future.

Next Data Startup idea? Build a common data format for sharing proteomics data. It’s in huge mess today because everyone speaks different “language”.
Data Visualization – 30-minute history. The first pie-chart was published in 1801.

Quote of the week – “3NF is typically a selfless model used by Enterprise data warehouse, which is used by the whole company. Astar schema is a selfish model, used by a department, because it’s already got aggregation in it.” (Forrester)

Cocktail party cheat-sheet –MySQL cannot do hash joins.

State of Data – July 26

Cool Numbers – Change flows through the free WI-FI, finally!

How to build Data-Driven startups – “In God we trust; all others must bring data”. (STRONG language warning!)

Data Tweet of the week – “People who choose their datastore based oh hearsay and not their own evaluation are doomed#nosql#cassandra

Pretty close second – “No global lock ever goes unpunished

The most passionate guy who ever talked data on this planet, ever, is BACK! New TED talk from Hans Rosling – you cannot miss Inception either

YeSQL – Nati Shalom evaluates role of SQL in a supposed post-SQL world.

noEducation on Data – Database Abnormalization 101

Data Visualization – More than a dozen Washington Post journalists spent over 2 years building the database — Top Secret America.

e.g. –1,931 private companies help Government in “Top-Secret activities, like AT&T helps in “nuclear operations”.

State of Data Last Week – July 19

Cool Numbers – “Are we there yet?”

  • To become a top 1,000 website you need at least 4.1 million visitors per month. 3 visitors every 2 seconds!
  • To become a top 500 website you need at least 7.4 million visitors per month.
  • To become a top 100 website you need at least 22 million visitors per month.
  • To become a top 50 website you need at least 41 million visitors per month.
  • To become a top 10 website you need at least 230 million visitors per month.
  • To become the number 1 website in the world? Then you need more than 540 million visitors per month. i.e., 210 visitors every second!
  • Your chance of having a mid-Air collision has increased 35% just last year! “”serious” airspace incursions in the United States rose from 2.44 per million flights to 3.28 per million flights.”

CouchDB becomes the first production-ready noSQL database.

Very large-scale data analytics using Google’s Dremel (PDF from 36th international VLDB conference, 2010) or, how to use SQL to query (nested columnar; *not* relational records) tables containing 1 trillion+ records.

Interesting takeaways –

  • 1.      MapReduce took 5000 sec to sort 85-B records; Dremel took about 15 sec!
  • 2.      MapReduce benefits from columnar storage just like typical RDBMS does
  • 3.      Dremel can achieve up to 100B records/second throughput with shared cluster
  • 4.      Trade speed with accuracy (finish faster with sampling!)
  • 5.      Bulk of the data scan is fast, getting to the last few percent reasonably fast is challenging (Pareto Principle)

Data Mining lecture notes from MIT. Great case study!

Data thought leader – Future of Databases – rumination on SQL, Ruby, C#  and evolution.

Data Visualization–

Ten Maps that changed the world.

Visual History of Data Visualization for past 500 years.

50 ways of visualizing BP Oil Spill

State of Data Last Week – July 12

Cool Numbers – On World Cup & July 4 –

  • If the keeper wears a red jersey, only 54 percent of penalties will succeed. It goes up to 69 percent if it’s yellow. If the keeper wears a blue jersey, 72 percent will go in. And the worst thing that the keeper could do is wear a green jersey because then 75 percent of penalty kicks will go in.  Why? Hint: Bull Fighting
  • Best speed for penalty ‘has to be between 56 and 65 miles and hour. If it’s faster than that, then the kicker is going to lose accuracy. If it’s slower than 56 miles an hour, then the goalie has a chance to catch it.’
  • Americans spent over $600M on fireworks the holiday weekend, 2/3rd of it was after backyard fireworks
  • We also spent $2B on cookouts on the weekend, with about $95M on lighter fluid.

Funniest quote on NoSQL movement and a lot of interesting pertinent discussion – (at Boston Big Data Summit) “The NoSQL movement is a lot like the Ron Paul campaign – it consists of people who are dissatisfied with the status quo, whose dissatisfaction has a lot to do with insufficient liberty and/or excessive expenditure, and who otherwise don’t have a whole lot in common with each other.” – Curt Monash

great, technical community for Data Geeks

Excellent primer on MongoDB Performance & Durability with interesting discussions (comments)

How to find out if Google Apps site loads about 6x than your’s or not – Finally a great, intuitive tool to use with both serialized and concurrent loads. You’re your own “Godzilla vs King Kong” at

Architecture of the Month – Salesforce

a) 72K customers, 150K apps, 8 Oracle RAC nodes on Redhat Linux, Lucene Text search servers and SAN disk array.
b) 16 instances; 680K tables / objects; 8 DBAs
c) A flex schema with up to 500 varchar columns; 170M daily transactions

Event of the week – “The State of Database with Tim Ellis” on July 13, in Hacker Dojo, Mountain View.

State of Data Last Week – July 6

Cool Numbers – Microsoft by Numbers

  • Microsoft sells 7 copies of Windows every second;
  • Largest 25 US dailies have 16M subscribers
  • Netflix alone has 14M members. Xbox has 23M!!

Hadoop Summit 2010 presentations are now availableMy favorite sessions were –

Good old (R)DBMS still alive– Why Quora chose MySQL as data store rather than the NoSQLs? The elevator pitch is – if your app can run fine with partitioned data (i.e., not having to go to more than 1 shard / partition), it will be fine with just about any data store. I also liked use of Donald Knuth’s “Premature optimization is the root of all evil” there.

OK, enough of dinosaurs! What about Facebook or World’s Largest Hadoop now – 21PB of data in a single HDFS cluster, 12TB/server/32GB RAM. Yahoo loses again with a “meagerly” 12PB!
And Twitter’s? Avi Bryant of Twitter Analytics Team speaks this week about how DabbleDB could improve Twitter ad efficiency.

Here’s a cool benchmark. Even in your “death grip ” iPhone 4 is still faster than 3gs (using FCC Mobile Broadband Test iPhone 4 download speed is almost 2x than 3gs.

Here’s one trivia to impress folks – “SQL IS “Turing Complete” (a fancy way to indicate you can do IF-ELSE and GO TO). If you want to brush it up – SQLZoo is a fantastic FREE resource. Spread it to all the new interns and developers – they will thank you after a year! Also, it is database agnostic.

On Mobile Data API front – it’s either an early Christmas or a very bad hurricane. Oracle joined SQLLite Consortium in late June.
Lastly, how expensive was Amazon’s 3-hr “outage” last week? About $5.25M (they’ve hourly revenue of $1.75M)


Get every new post delivered to your Inbox.