State of Data Last Week – Sep 26

<Cool Numbers> Why performance, caching (and may be color!) matter 5x more today, than in 2003 – even for environment?

  • “From 2003 to 2009 the average web page grew from 93.7K to over 507K” (5x)
  • “top 500 home page goes from 507K and 64.7 requests upon initial cache-cleared load to 98.5K and 16.1 requests. On average, caching saves 81 percent of bytes, and 75 percent of the requests.”
  • And how green is it? Viewing a simple web page generates ~20 milligrams of CO2 per second. It ups to 300mg of CO2 per second with a page with video.
  • A black version of Google would save about 750 megawatt-hours – enough to power 1000 houses for a year.

<Analysis> If you’re 25, you probably listen a lot of …Primus? analyzed data to learn listening preferences vary with age and gender.

Why you should look beyond typical “Insights from Top 10” and adopt “Weighted Sort”. Avinash is also the author of two best-selling (and ultra great) books ever on Web Analytics.

<Data Outage> Persistent cache-database lookup – under a race condition created Facebook outage for 1/10th of the known universe this week.

<Strategy/Arch> Your browsing data could be ever-persistent, thanks to HTML5 — Cookies that never, ever would go away
<Learning> Finally! OCR-enabled powerful search within Video lectures. TalkMiner OCR-processes whenever slides are shown in video talks; and adds the content and the time it appears to the search metadata.

<Big Data> Facebook data center bill is about $50M/year. Here’s a great analysis of cost and excel based cost model – 57% cost is from servers (amortized over 3 years). Should someone be brave enough buck the trend and have less of servers then?

<Visualization> After burglars (see last week’s), folks buying weed are joining the data revolution. What is Marijuana really worth – apparently real-time mash-up of price data.

<Cocktail party cheat-sheet> Yes, it IS a legal requirement to generate invoice numbers without gap in UK (VAT-registered). i.e., your data store can never “lose” a sequence / auto-gen number. Why? It is more difficult to hide revenue from tax authorities without any invoice “missing”.

State of the Data Last Week – Sep 19

<Cool Numbers> Not all inches are the same! And, burglars are getting better at Data analysis.

  • (36 inches of Old Navy) MINUS (36 inches of GAP) = 2 inches!
  • Finally, coming to some use! Burglars steal $200,000 mining Facebook data. Simple – they were just monitoring “Going to have a great time away this weekend” updates!
  • Charles Trippy, 26, is the top-most Youtube Celebrity with his “Internet Killed Television” vLog / reality show drawing over 160M views. He endorses Gillette, among other brands.
  • Probability a five-lettered pronounceable domain is available – < 8%

<Data Outage>  Cost of JP Morgan Chase Oracle outage (suspected) caused by EMC hardware – $132M ACH transfers held up; 1000 auto & student loan applications were lost.

<Analysis> Data Analysis Briefbook from CERN — perfect place to quickly check “Markov Chain” or “Bayesian Statistics” in 2 minutes.

Google (2nd time) fires Engineer for unauthorized access to customer data.

Finally! 128GB SSD for your laptop that’s about $2/GB and a monstrous performance.

Want noSQL like heavy-write performance with (transactional) Oracle system? Evaluate ‘batch,nowait’ mode in how committed data is written. Strictly NOT for production – may cause data loss (just like some noSQL would).

<Big Data> Michael Stonebaker, who created more databases than most people have written lines of code, thinks this time (SciDB) he got it right.

How, and why, Google bought an old paper mill in Finland, spent $260M and converted it to a mega data center that still looks like a paper mill from outside.

Facebook creates an awesome utility for Online Schema Change in MySQL. The source code is open!

> Every country is best at something. Costa Rica? Happiness. Cyprus? Kidney Transplants!

<Cocktail party cheat-sheet> Sequel or S-Q-L? 52% thinks it’s the former.

State of Data Last Week – Sep 12

<Cool Numbers> Things change.

  • Top words in most popular Tweets ever – Fish; Rob; Smile; like; just. Top_Tweet_Wordcloud

    Top words first sentences of most popular novels ever – Man; first; English; father; time. Top_Novel_Wordcloud.

  • Worldwide data volume (2009) – 0.8 ZetaBytes. 2020 (Projected)– 35 ZetaBytes. Surprise – Structured data overtaking unstructured in 2014.
  • If you use pennies as flooring material, it will cost *you* about $2.50 per square ft. US Mint spends about $4.20/sf to produce it though. Compare that with about $5/sf for finished walnut!
  • Disparity in Death – out of 698 obituaries published in New York Times this year, only 92 (13%) are of women.

<Strategy/Arch> Google search splits with MapReduce — “didn’t allow Google to update its index as quickly as it would like”.

Never before seen – Peek inside Microsoft’s large scale online services (PDF; Hotmail, Bing, Hadoop). Bing servers are CPU-bound as it (uses) “data compression on memory and disk data .. causing extra processing“. Google uses a home-grown “fast” compression algorithm that “minimizes number of shift operations during decompression”.

<Analysis> Next time someone gets lost in wilderness, use Bayesian modeling to find him. With every minute one is not found effective search radius increases by 50m (for adults). (Danny Boyle’s next movie is about true story of a climber trapped under a boulder in one Utah canyon)

Driver’s ed. class makes Indiana teenagers 4x more accident prone! Why? Hint.
<Big Data> “Google Instant” search increased server load 5-7x.

<Schema> Distributed Caching example to load balance across nodes / partitions / shards without having to rebalance keys if the number of partitions (e.g., nodes added) changes

<Learning> Learn basic statistics from UK Parliament. Seriously good stuff.

Learn how Memcached works in 15 minutes – from a story.

3 reasons to use MongoDB – Simple Query (no join); Sharding; GridFS (storing actual files in DB).

<Visualization> Higher Education Economy– a bubble?

<Cocktail party cheat-sheet> 123 or 1000000000000000000000000000000000000 – which takes more bytes inside an Oracle record? It’s 123. Wonderful explanation here.

State of Data Last Week – Sep 05

<Cool Numbers> What do you like to eat? Look at your phone!

<Strategy> From MIT’s best-ever Data Management checklist. Every project should have this, or a similar one, validated at start!

<Analysis> “median Fortune 1000 company could increase its revenue by $2.01 billion a year just by marginally improving the usability of the data already at its disposal”
Start towards the first million by tracking these 6 metrics for your Web App. Everything is put in this excellent spreadsheet model for easy math! (Note: FreshBooks follows this model)

<Big Data> Pomegranate stores billions of tiny little files – no SPOF using ‘distributed extensible hash table’.

How CERN (LHC) manages 20PB of data in JBOD (Just Old Bunch of Disks) hooked up to Linux boxes

<Schema> How can you represent inheritance in a SQL server (RDBMS) database?

<Learning> While at that, check out the book – ‘SQL Antipatterns’. If you’d liked Martin Fowler, you’d love it as well. It’s just been out less than a couple of months!
The problems with ACID, and how to fix it without going noSQL – Daniel Abadi proposes lock avoidance to fix the problem (Actual Paper – PDF)

<Visualization> 500 years of science progress as a Subway Map. Now, Science has time-line thanks to Data.

<Cocktail party cheat-sheet> Heavy drinkers outlive abstainers! “even after adjusting for all covariates, abstainers and heavy drinkers continued to show increased mortality risks of 51 and 45%, respectively, compared to moderate drinkers.”