State of Data Last Week – Aug 29

Cool NumbersWant your kid to play in Major League some day? Move to a smaller city!

Cary Millsap, Performance Guru, explains in ACM why “99% of the X workflow executions should finish under 100ms” is a better requirement statement than “the average response time should be 200ms”.

CouchDB – The Definitive Guide” book is now available FREE! Chapters on “Eventual Consistency” (with a case study) is recommended (Hadron Collider uses CouchDB)

3 rules of thumb for using Bloom Filter – “..add 1024 elements to 1KB bloom filter, ..with ~2% false positive rates”.

O’Reilly Strata (‘The Business of Data’) Conference Call is now open (Feb 1-3, 2011)

Data VisualizationStock heat map of performance of common stocks traded on Nasdaq (Thanks to Karin!)

What’s the solution to “Data Glut” or information overload? Use your eyes more! David McCandless explains in TedTalk why in the month of April there’s a media surge on reports on violent video games.

Best Science and Health Infographics from New York Times (Social web of Smokers and Measure of a President are cool)

Cocktail party cheat-sheet – “Data is the new oil soil” – the new fertile medium.
Why Trader Joe’s is so different? It stores less than 4000 SKUs compared to 50,000 SKUs stored by typical grocery stores. So, it sells $1,750 worth of items/sq ft (2x of Whole Foods).
Same magnitude earthquakes will kill up to 200x less in a democratic country than in dictatorships.


State of Data Last Week – Aug 22

Cool Numbers

What exactly is “Data Science” and why the future belongs to folks turning data into product.

SaaS vendor Workday’s interesting database architecture – Entirely in-memory; “Workday’s database, if relationally designed, would require “1000s” of tables”

Excellent guide to performance (index strategies) for data developers – time to replace the overused metaphor – “index is more complex than a phone book because it undergoes constant change .. phone book doesn’t provide enough space to write a new entry between to existing ones

Hadoop ecosystem World Map – e.g., hiho and Sqoop for loading RDBMS data into Hadoop.

bashreduce – MapReduce in a bash script. Split your unix command across many machines. Brilliant!

Teradata Product Strategy – intelligently split databases among flash (SSD) and disk

Data Visualization – We all know how visualizing sort algorithms look like, but “What different sort algorithm sounds like” takes a cake. Merge-sort sounds so 80-s 😉

Map of worldwide undersea cables
Cocktail party cheat-sheetMembase is the 500,000 ops-per-second database powering Farmville.

State of Data Last Week – Aug 15

Cool Numbers – 1 and only 1!

Data is art. Martin Wattenberg’s (Flowing Media) presentation in MIT World (check out the similarity of “How to be” and “How not to be” search; FleshMap)

10 common Hadoop-able problems (from the guy who’d done it first)

Don’t fight, Integrate – Great deployment of noSQL (analytics) and RDBMS (payment) together.

How to solve Netflix recommendation challenge with (a) huge, yet, (b) limited data
Data Portability –
How opening up of data led to progress on Alzheimer’s

How did Weather Data get opened

A new Taxonomy of Social Networking Data (if you already know what’s “Incidental Data”, don’t bother ;-))

Data Visualization – How Florence Nightingale collected data, presented statistical graphics (incl Bar Charts!) and brought in immense improvements in health standards.

Cocktail party cheat-sheetWhat does your Credit Card provider know about you? If you use your credit card at dentist’s, you’re 4x times less likely to miss payments in next year compared to one who uses his card at a drinking place.

State of Data Last Week – Aug 08

Cool Numbers – It’s depressing!

Analysis or Dare?
(a) People who swat flies have a thing for US Today?
(b) Believers in Alien abduction are more likely than nonbelievers to drink Pepsi?

(c) People who cut their sandwiches diagonally are more likely to wear RayBan?

(d) Southerners bought more white, green and pink than other regions’ residents?

(e) E.R. care accounts for less than 3% of healthcare spending?
Talking of analysis, offloading Predictive Analysis to Google API – Machine Learning as a Service, all you need to know.

Ben Horowitz on “Taking the mystery out of Scaling a company” – a lot applies to scaling technology as well. e.g., replace “people” with “server” in the snippet –

“when adding people server into the company feels like more work than the work that you can offload to the new employees servers”

Bit of nostalgia for those who’d worked with “large, 12MB databases” (or before) – Oracle 5 Installation Live

NOSQL Patterns – finally!

CouchDB has started losing data or, at least, started admitting it.
“ once the bad code path is triggered, subsequent writes to the database are never committed. This means there is potential data-loss for users of 1.0.0.”

Is there a LAMP framework equivalent for Big Data Processing?

Architecture of the Month – LinkedIn Data Infrastructure

Data Visualization – So you want to watch YouTube flowchart (click on the picture for large version)
Automotive Family Tree – really cool!

Cocktail party cheat-sheet –Even the cheapest SSD (Solid State Disk) is 5-7x faster at TPC than usual drives (for PostGreSQL)

State of Data Last Week – Aug 01

Cool Numbers – Change flows through the free WI-FI, finally!

noRestart? Last Monday, Twitter database took 12 hrs to restart making the whole ecosystem “unusable”.

Papers from R User Conference July 20-23, 2010 are available now. This has a great introduction “High Performance Computing with R”.

Free (legal!) book on Probability and Statistics with R
(note – this is introduction to Statistics and Probability using R; NOTIntroduction to R using Statistics and Probability)

37Signals’ database elevator pitch
– (1) Use Solid State Disks; (2) Delay sharding as long as you can.
New version of the database SQL Anywhere 12 is now publicly available. Developer free edition can be downloaded here.

Business of Data
– Google and CIA co-funding the same data mining startup – Recorded Future.

Next Data Startup idea? Build a common data format for sharing proteomics data. It’s in huge mess today because everyone speaks different “language”.
Data Visualization – 30-minute history. The first pie-chart was published in 1801.

Quote of the week – “3NF is typically a selfless model used by Enterprise data warehouse, which is used by the whole company. Astar schema is a selfish model, used by a department, because it’s already got aggregation in it.” (Forrester)

Cocktail party cheat-sheet –MySQL cannot do hash joins.

State of Data – July 26

Cool Numbers – Change flows through the free WI-FI, finally!

How to build Data-Driven startups – “In God we trust; all others must bring data”. (STRONG language warning!)

Data Tweet of the week – “People who choose their datastore based oh hearsay and not their own evaluation are doomed#nosql#cassandra

Pretty close second – “No global lock ever goes unpunished

The most passionate guy who ever talked data on this planet, ever, is BACK! New TED talk from Hans Rosling – you cannot miss Inception either

YeSQL – Nati Shalom evaluates role of SQL in a supposed post-SQL world.

noEducation on Data – Database Abnormalization 101

Data Visualization – More than a dozen Washington Post journalists spent over 2 years building the database — Top Secret America.

e.g. –1,931 private companies help Government in “Top-Secret activities, like AT&T helps in “nuclear operations”.

State of Data Last Week – July 19

Cool Numbers – “Are we there yet?”

  • To become a top 1,000 website you need at least 4.1 million visitors per month. 3 visitors every 2 seconds!
  • To become a top 500 website you need at least 7.4 million visitors per month.
  • To become a top 100 website you need at least 22 million visitors per month.
  • To become a top 50 website you need at least 41 million visitors per month.
  • To become a top 10 website you need at least 230 million visitors per month.
  • To become the number 1 website in the world? Then you need more than 540 million visitors per month. i.e., 210 visitors every second!
  • Your chance of having a mid-Air collision has increased 35% just last year! “”serious” airspace incursions in the United States rose from 2.44 per million flights to 3.28 per million flights.”

CouchDB becomes the first production-ready noSQL database.

Very large-scale data analytics using Google’s Dremel (PDF from 36th international VLDB conference, 2010) or, how to use SQL to query (nested columnar; *not* relational records) tables containing 1 trillion+ records.

Interesting takeaways –

  • 1.      MapReduce took 5000 sec to sort 85-B records; Dremel took about 15 sec!
  • 2.      MapReduce benefits from columnar storage just like typical RDBMS does
  • 3.      Dremel can achieve up to 100B records/second throughput with shared cluster
  • 4.      Trade speed with accuracy (finish faster with sampling!)
  • 5.      Bulk of the data scan is fast, getting to the last few percent reasonably fast is challenging (Pareto Principle)

Data Mining lecture notes from MIT. Great case study!

Data thought leader – Future of Databases – rumination on SQL, Ruby, C#  and evolution.

Data Visualization–

Ten Maps that changed the world.

Visual History of Data Visualization for past 500 years.

50 ways of visualizing BP Oil Spill