State of Data Last Week – Sep 12
September 12, 2010
<Cool Numbers> Things change.
- Top words in most popular Tweets ever – Fish; Rob; Smile; like; just. Top_Tweet_Wordcloud
Top words first sentences of most popular novels ever – Man; first; English; father; time. Top_Novel_Wordcloud.
- Worldwide data volume (2009) – 0.8 ZetaBytes. 2020 (Projected)– 35 ZetaBytes. Surprise – Structured data overtaking unstructured in 2014.
- If you use pennies as flooring material, it will cost *you* about $2.50 per square ft. US Mint spends about $4.20/sf to produce it though. Compare that with about $5/sf for finished walnut!
- Disparity in Death – out of 698 obituaries published in New York Times this year, only 92 (13%) are of women.
<Strategy/Arch> Google search splits with MapReduce — “didn’t allow Google to update its index as quickly as it would like”.
Never before seen – Peek inside Microsoft’s large scale online services (PDF; Hotmail, Bing, Hadoop). Bing servers are CPU-bound as it (uses) “data compression on memory and disk data .. causing extra processing“. Google uses a home-grown “fast” compression algorithm that “minimizes number of shift operations during decompression”.
<Analysis> Next time someone gets lost in wilderness, use Bayesian modeling to find him. With every minute one is not found effective search radius increases by 50m (for adults). (Danny Boyle’s next movie is about true story of a climber trapped under a boulder in one Utah canyon)
<Schema> Distributed Caching example to load balance across nodes / partitions / shards without having to rebalance keys if the number of partitions (e.g., nodes added) changes
<Learning> Learn basic statistics from UK Parliament. Seriously good stuff.
Learn how Memcached works in 15 minutes – from a story.
3 reasons to use MongoDB – Simple Query (no join); Sharding; GridFS (storing actual files in DB).
<Visualization> Higher Education Economy– a bubble?
<Cocktail party cheat-sheet> 123 or 1000000000000000000000000000000000000 – which takes more bytes inside an Oracle record? It’s 123. Wonderful explanation here.