State of Data Last Week – #37

#analysis – Why Travel Sites kick tires so much better – ‘biggest advantage for travel sites was “prices vs. the same products offline”: 40% reported than online travel sites offer better prices than do offline locations, vs. 10% for non-travel sites

#architecture –
Death Match! Cloud vs. Hosted SSD —  MySQL Chief Performance Architect concludes put simply, price is in the same order of magnitude, but performance is two to three orders of magnitude different

#big_data –
A really powerful server at 0% load consumes 32% of peak power. Land is <2% of Data Center costs, Mechanical & Electrical are 70-85%. See this Data Center Energy report from Microsoft (PDF) on ‘why the industry is nuts to focus on density’.

#DBMS – Can a ‘UNION’ be converted to ‘FULL OUTER JOIN’? In the cases it could be, see how it would.

#learning – Encyclopedia of Machine Learning’ is now available online / for download, FREE.

#visualization – The UX of Data’ – device independence is just one UX outcome of data in the cloud, at some level the ‘devices’ itself start generating more data leading to a Great Barrier Reef like self-sustaining ecosystem


11 Condensed Lessons from Moving Big Data for Quarter of Adult Lifetime

  1. Best Practices below would solve for about 25% of the problems. Rest 75% is always around the code, platform (database/os/hardware/IO/memory etc). So, prepare a lot of firepower downstream.
  2. Data movement is like moving homes. Application data management practices are like bringing grocery home. One paradigm fails elsewhere. They’re mostly orthogonal.
  3. Batch, batch, batch – never migrate data atomically (one-by-one). Atomic works great when incoming data needs to be validated. When migrating existing data a “whole shebang” approach is mandatory as data is already “proved”.
  4. To extend the paradigm#3 –
    • Try migrate taking app offline
    • Try disable constraint
    • Try drop/recreate indexes rather than organically built it along with data movement
    • Disable triggers – the triggers have already worked on the data
    • If you have to batch a logical entity in chunks, try not to commit after every batch size. Try commit as less as you can. The best number is 1/migration – a good number is no more than 1 every 15 minutes
  5. Don’t use application code for your largest entities (say, audit or archive data). Use custom-written thin scripts.
  6. If relational databases are in play, minimize redo and other logging. There’re multiple ways of doing so.
  7. Know what 10% of your structures (say, tables) contain 90% of the data (say, 3 tables contain 140GB, rest 77 contain 10GB). That’s typically enterprise pattern. Now, “CTRL+F” in code base for the former and optimize.
  8. Have an end-goal in target for migration / movement time and continuously try hitting that goal. Stick it to your cube wall. When we did QBDT data sync, it originally took 8 hrs+. Our BHAG was “< 1 minute for 98% companies”. We achieved something smart because we declared a (then) stupid target.
  9. Tune for IO. Impact of faster IO is non-linear. A 15% faster IO device (or 15% more RAM) would buy you hours of saving.
  10. Parallelize only for processes on #7. Parallelization has created more performance problems than it solved (generally).
  11. Remember #1; trace a baseline migration and get it analyzed by a “systems thinker type of guy” who also understands your underlying technology (e.g., Oracle / mySQL / Hadoop /NetEzza).

‘Explain Scalability in 90 sec’

A really scary alien holds me hostage and wants to know what this ‘scalability’ thing is. I have 90 sec. I have to give a shot to squeeze 15 years of field lesson intuitively enough in a minute to save humanity. Let’s try –

‘A tiny little ant can carry 30 times its weight. Now, scale the ant proportionally up to the weight of an average human to about 150lbs. It can only carry about 30lbs at that scale, less than half of an average human of same weight.

Systems function fundamentally different at scale. It is never linear and it depends on multiple internal and external factors. In ants case, e.g., the factors range from the anatomy of its leg, curvature of its back, gravity, little sand particle that may have gotten stuck under its belly and be wrongly inflated to 56lbs when it scales up to whether the ant remains motivated to carry heavy things or hit a fast-food place first thing after.

Understanding this multiplicity and non-linearity — really understanding it well — is half of solving scalability. The other half is common-sense, and not trying to steal ‘ant’ patterns when building an ‘owl’ system or vice versa.

Now, oh Honorable Alien, since no one else can replace us humans – please spare us for now.’

State of Data Last Week – #36

#Inside Intuit – Lots of Data Job openings in Intuit – please refer your Data Contacts to work in one of the “best places to work” in U.S..

Req # Title
70676 Sr. Web Developer
70952 Staff Software Engineer Data
71010 Senior Software Engineer Data Warehousing
71012 Data Architect
71058 Senior Scientist: Analytics
71059 Senior Scientist: Econometrics
71061 Senior Scientist: Data Mining
71115 Data Warehouse Engineer
71150 Data Scientist

#analysis – Watson
, 80 teraflops machine, wins Jeopardy against best of Humans, but also asserted Toronto is a U.S. city. Here’s a fascinating white paper from IBM Lab / Stanford highlighting the journey (PDF) with Machine Learning, Natural Language Processing and Information Retrieval.

#architecture –
A Data Management Architectural Pattern behind interactive web applications using Hadoop with Membase, serving ‘sub-millisecond latency’.


290,000 people played CityVille in its first 24 hours!

#big_data – Scale-up vs. Scale Out – What happens when you pitch Watson (a very large box) against Google (Cloud Grid) for Jeopardy?

#DBMS – Some old-school, reliable fun to calculate overlapping hours for Labor Transactions – from defining the problem, design tables to write elegant SQLs.

#learning –What I do with data at 37Signals’ – answering ‘why people cancel, how they interact with signup pages etc’ and what tools get used (including HP-12C)

#visualization – Trulia created an interactive map of ‘Rent vs. Buy’ in America’s 50 largest cities with associated Job Growth, Foreclosure and Employment data.




Performance Optimization – Ben Franklin Way

‘In 1745, Franklin pondered why ships sailing from America to England have quicker voyages than returning ships. Over next ten years he would gather time differentials from various ships, and reason for variance. Most mentioned a kind of current.

He wrote in 1767 that a voyager may know when it is in the Gulf stream by warmth of the water. He was the first to call this eastward current “Gulf Stream” and had a map printed plotting its course’.

Wit and Wisdom of Benjamin Franklin

Performance optimization, Benjamin Franklin way thus is  –

  1. Discover and Define the problem well- pit what works “fast” against what goes “slow” (going to England is fast; coming back to U.S. is slow)
  2. Measure for a reasonable length of time – Ben did it for 10 years!
  3. Ask the Ship Captains (end users) – they may provide the clue, e.g., “when I load Chrome and IE together, my Outlook stalls
  4. Document the solution, ideally visually



Learnt Today : Evidence – Feeling = Science

‘The only way to have real success in science, the field I’m familiar with, is to describe the evidence very carefully without regard to the way you feel it should be. If you have a theory, you must try to explain what’s good and what’s bad about it equally.”

– Richard P. Feynman in ‘What Do You Care What Other People Think?


Lesson Learnt – On importance of ‘Initial Design’

“By far the biggest lesson I learned from the 737 was never to take an initial design configuration as a given. It’s human nature to do just that and go charging ahead to work within an existing framework. However, that doesn’t necessarily lead to great airplanes”.

-Joe Sutter in 747