State of Data Last Week – Sep 12

<Cool Numbers> Things change.

  • Top words in most popular Tweets ever – Fish; Rob; Smile; like; just. Top_Tweet_Wordcloud

    Top words first sentences of most popular novels ever – Man; first; English; father; time. Top_Novel_Wordcloud.

  • Worldwide data volume (2009) – 0.8 ZetaBytes. 2020 (Projected)– 35 ZetaBytes. Surprise – Structured data overtaking unstructured in 2014.
  • If you use pennies as flooring material, it will cost *you* about $2.50 per square ft. US Mint spends about $4.20/sf to produce it though. Compare that with about $5/sf for finished walnut!
  • Disparity in Death – out of 698 obituaries published in New York Times this year, only 92 (13%) are of women.

<Strategy/Arch> Google search splits with MapReduce — “didn’t allow Google to update its index as quickly as it would like”.

Never before seen – Peek inside Microsoft’s large scale online services (PDF; Hotmail, Bing, Hadoop). Bing servers are CPU-bound as it (uses) “data compression on memory and disk data .. causing extra processing“. Google uses a home-grown “fast” compression algorithm that “minimizes number of shift operations during decompression”.

<Analysis> Next time someone gets lost in wilderness, use Bayesian modeling to find him. With every minute one is not found effective search radius increases by 50m (for adults). (Danny Boyle’s next movie is about true story of a climber trapped under a boulder in one Utah canyon)

Driver’s ed. class makes Indiana teenagers 4x more accident prone! Why? Hint.
<Big Data> “Google Instant” search increased server load 5-7x.

<Schema> Distributed Caching example to load balance across nodes / partitions / shards without having to rebalance keys if the number of partitions (e.g., nodes added) changes

<Learning> Learn basic statistics from UK Parliament. Seriously good stuff.

Learn how Memcached works in 15 minutes – from a story.

3 reasons to use MongoDB – Simple Query (no join); Sharding; GridFS (storing actual files in DB).

<Visualization> Higher Education Economy– a bubble?

<Cocktail party cheat-sheet> 123 or 1000000000000000000000000000000000000 – which takes more bytes inside an Oracle record? It’s 123. Wonderful explanation here.


About Nilendu Misra
I love to learn, create and coach. Things that I do well are - Communicating ideas - verbally or through words and diagrams; Problem Solving - Logical or Abstract; Very Large Scale Systems; think about 'Frighteningly Simple' approach first. Things that I intend to do better are - Establishing Stringent Process; Exchanging Tough Feedback; Keeping up with my reading or To-Do list to be able to completely relax.

Comments are closed.

%d bloggers like this: