State of Data #57

#analysis – Business of Big Data’ – how venture fund analysts look at Taxonomy of Big data. 


#architecture – Take the politics away for a while, this “garbled” tweet analysis is possibly the best UTF8 / encoding tutorial ever.
Why didn’t I just say “The software read in a UTF-8 encoded JSON stream of tweets and displayed it with an ANSI Windows Code Page 1252.” Because that wouldn’t be nearly as fun.


#big_data – Love this thought experiment – reproducing YouTube with Oracle-driven architecture would cost ~$0.5B in hardware and software license

#conference –   First MongoDB Meetup in Bay Area, July 19

2011 Joint Statistical Meetings, Miami, July 31-Aug 3 – heavy emphasis on using R with Predictive Analysis 

#DBMS – Expert Oracle GoldenGate (book) is now available in Safari. GoldenGate could be used either for DR or heterogeneous data integration with/without transformation

#learning –
What range is Hadoop compression factor? “6-10X compression is common for “curated” Hadoop data”.

For low-value machine generated data, “lot of it would be repetitive “I’m fine; nothing to report” kinds of events.

Compression factor also Reverse-engineered Yahoo’s recent “standard Hadoop server” config – 

  • 8-12 cores
  • 48 gigabytes of RAM
  • 12 disks of 2 or 3 TB each


#visualization – Superb analysis of Slopegraphs, Edward Tufte’s lesser popular idea. [ed. It probably did not pick up because it is like reading a book while skiing – lots of jarred vertical eye movements to read text for horizontal script readers]




  • Skiing Data Eye Candy – Serious skiers can see data on speed, vertical rate of descent etc in the corner of this goggles. Data can later be uploaded to computer.
  • No workload increase in a decade – “If you add up all the hours worked in the economy in June 2011 they are equal to all the hours worked in February of 1999”. Interesting data on supply-side economy.
  • Watch out for correlation attack – What is the relation between Wimbledon and Washing Machine repair business
  • Exa-iting – A single telescope aims to generate more data a day in 2020 than the entire internet generates today – an Exabyte a day. But, universe has only about 4% actual ‘matter’ – so the data should compress well 😉



About Nilendu Misra
I love to learn, create and coach. Things that I do well are - Communicating ideas - verbally or through words and diagrams; Problem Solving - Logical or Abstract; Very Large Scale Systems; think about 'Frighteningly Simple' approach first. Things that I intend to do better are - Establishing Stringent Process; Exchanging Tough Feedback; Keeping up with my reading or To-Do list to be able to completely relax.

Comments are closed.

%d bloggers like this: