11 Condensed Lessons from Moving Big Data for Quarter of Adult Lifetime

  1. Best Practices below would solve for about 25% of the problems. Rest 75% is always around the code, platform (database/os/hardware/IO/memory etc). So, prepare a lot of firepower downstream.
  2. Data movement is like moving homes. Application data management practices are like bringing grocery home. One paradigm fails elsewhere. They’re mostly orthogonal.
  3. Batch, batch, batch – never migrate data atomically (one-by-one). Atomic works great when incoming data needs to be validated. When migrating existing data a “whole shebang” approach is mandatory as data is already “proved”.
  4. To extend the paradigm#3 –
    • Try migrate taking app offline
    • Try disable constraint
    • Try drop/recreate indexes rather than organically built it along with data movement
    • Disable triggers – the triggers have already worked on the data
    • If you have to batch a logical entity in chunks, try not to commit after every batch size. Try commit as less as you can. The best number is 1/migration – a good number is no more than 1 every 15 minutes
  5. Don’t use application code for your largest entities (say, audit or archive data). Use custom-written thin scripts.
  6. If relational databases are in play, minimize redo and other logging. There’re multiple ways of doing so.
  7. Know what 10% of your structures (say, tables) contain 90% of the data (say, 3 tables contain 140GB, rest 77 contain 10GB). That’s typically enterprise pattern. Now, “CTRL+F” in code base for the former and optimize.
  8. Have an end-goal in target for migration / movement time and continuously try hitting that goal. Stick it to your cube wall. When we did QBDT data sync, it originally took 8 hrs+. Our BHAG was “< 1 minute for 98% companies”. We achieved something smart because we declared a (then) stupid target.
  9. Tune for IO. Impact of faster IO is non-linear. A 15% faster IO device (or 15% more RAM) would buy you hours of saving.
  10. Parallelize only for processes on #7. Parallelization has created more performance problems than it solved (generally).
  11. Remember #1; trace a baseline migration and get it analyzed by a “systems thinker type of guy” who also understands your underlying technology (e.g., Oracle / mySQL / Hadoop /NetEzza).

About Nilendu Misra
I love to learn, create and coach. Things that I do well are - Communicating ideas - verbally or through words and diagrams; Problem Solving - Logical or Abstract; Very Large Scale Systems; think about 'Frighteningly Simple' approach first. Things that I intend to do better are - Establishing Stringent Process; Exchanging Tough Feedback; Keeping up with my reading or To-Do list to be able to completely relax.

Comments are closed.

%d bloggers like this: