State of Data #57
July 14, 2011
#analysis – ‘Business of Big Data’ – how venture fund analysts look at Taxonomy of Big data.
#architecture – Take the politics away for a while, this “garbled” tweet analysis is possibly the best UTF8 / encoding tutorial ever.
“Why didn’t I just say “The software read in a UTF-8 encoded JSON stream of tweets and displayed it with an ANSI Windows Code Page 1252.” Because that wouldn’t be nearly as fun.”
#big_data – Love this thought experiment – reproducing YouTube with Oracle-driven architecture would cost ~$0.5B in hardware and software license
#conference – First MongoDB Meetup in Bay Area, July 19
2011 Joint Statistical Meetings, Miami, July 31-Aug 3 – heavy emphasis on using R with Predictive Analysis
#DBMS – Expert Oracle GoldenGate (book) is now available in Safari. GoldenGate could be used either for DR or heterogeneous data integration with/without transformation
#learning – What range is Hadoop compression factor? “6-10X compression is common for “curated” Hadoop data”.
For low-value machine generated data, “lot of it would be repetitive “I’m fine; nothing to report” kinds of events.”
Compression factor also Reverse-engineered Yahoo’s recent “standard Hadoop server” config –
- 8-12 cores
- 48 gigabytes of RAM
- 12 disks of 2 or 3 TB each
#visualization – Superb analysis of Slopegraphs, Edward Tufte’s lesser popular idea. [ed. It probably did not pick up because it is like reading a book while skiing – lots of jarred vertical eye movements to read text for horizontal script readers]
- Skiing Data Eye Candy – Serious skiers can see data on speed, vertical rate of descent etc in the corner of this goggles. Data can later be uploaded to computer.
- No workload increase in a decade – “If you add up all the hours worked in the economy in June 2011 they are equal to all the hours worked in February of 1999”. Interesting data on supply-side economy.
- Watch out for correlation attack – What is the relation between Wimbledon and Washing Machine repair business
- Exa-iting – A single telescope aims to generate more data a day in 2020 than the entire internet generates today – an Exabyte a day. But, universe has only about 4% actual ‘matter’ – so the data should compress well 😉