11 Condensed Lessons from Moving Big Data for Quarter of Adult Lifetime
February 23, 2011
- Best Practices below would solve for about 25% of the problems. Rest 75% is always around the code, platform (database/os/hardware/IO/memory etc). So, prepare a lot of firepower downstream.
- Data movement is like moving homes. Application data management practices are like bringing grocery home. One paradigm fails elsewhere. They’re mostly orthogonal.
- Batch, batch, batch – never migrate data atomically (one-by-one). Atomic works great when incoming data needs to be validated. When migrating existing data a “whole shebang” approach is mandatory as data is already “proved”.
- To extend the paradigm#3 –
- Try migrate taking app offline
- Try disable constraint
- Try drop/recreate indexes rather than organically built it along with data movement
- Disable triggers – the triggers have already worked on the data
- If you have to batch a logical entity in chunks, try not to commit after every batch size. Try commit as less as you can. The best number is 1/migration – a good number is no more than 1 every 15 minutes
- Don’t use application code for your largest entities (say, audit or archive data). Use custom-written thin scripts.
- If relational databases are in play, minimize redo and other logging. There’re multiple ways of doing so.
- Know what 10% of your structures (say, tables) contain 90% of the data (say, 3 tables contain 140GB, rest 77 contain 10GB). That’s typically enterprise pattern. Now, “CTRL+F” in code base for the former and optimize.
- Have an end-goal in target for migration / movement time and continuously try hitting that goal. Stick it to your cube wall. When we did QBDT data sync, it originally took 8 hrs+. Our BHAG was “< 1 minute for 98% companies”. We achieved something smart because we declared a (then) stupid target.
- Tune for IO. Impact of faster IO is non-linear. A 15% faster IO device (or 15% more RAM) would buy you hours of saving.
- Parallelize only for processes on #7. Parallelization has created more performance problems than it solved (generally).
- Remember #1; trace a baseline migration and get it analyzed by a “systems thinker type of guy” who also understands your underlying technology (e.g., Oracle / mySQL / Hadoop /NetEzza).