July 27, 2009
After a round of restructuring, an Engineer asked me- “So how can I keep my job in the next one?” I do not think anyone can answer that precisely. But one can prepare to be relevant across a reasonable time window – 3 to 6 years – in one’s domain. Then, even if micro-factors play evil — like, the project one works with gets scrapped — one can perhaps give a great shot of finding the next one.
In my area — Data — I have a laundry list of things that I believe will shape our career, its growth and the rewards for the near-term. I strongly believe “data” is a great domain to be in. Hal Varian, Google’s Chief Economist, thinks so too.
“I keep saying the sexy job in the next ten years will be statisticians.
People think I’m joking, but who would’ve guessed that computer engineers would’ve been the sexy job of the 1990s? The ability to take data—to be able to understand it, to process it, to extract value from it, to visualize it, to communicate it—that’s going to be a hugely important skill in the next decades, not only at the professional level but even at the educational level for elementary school kids, for high school kids, for college kids.
Because now we really do have essentially free and ubiquitous data. So the complimentary scarce factor is the ability to understand that data and extract value from it.”
With that in mind, let’s look at my list of things that data technologists or leaders have to embrace to solve the next levels of the Data game. I will cover TEN key topics in two posts.
(1) Map / Reduce –
Even though Micheal Stonebraker seriously doubts its incremental benefits from that of parallel processing in traditional RDBMS, I personally think Map Reduce is here to stay at least for non-structural data. In Java One 2009, EHarmony folks described how they used the technology to offload jobs that took days in traditional RDBMS to Amazon S3 to return under single-digit hours.
I would personally not declare Hadoop a clear winner, but understand Map Reduce — learn how to write effective Java maps to deal with TBs of data with just about any of the Hadoop-like things out there.
Home work –
Download Airplane bird hit statistics data from FAA; Write functions (preferably in Java, but you can also use PLSQL or even C++) to map, partition, compare and reduce; try to find out seasonal variability in reptile strike in Florida vs the same for bird strikes in other states. Extra point to upload the resultant set to a “cloud” data storage.
(2) “R” –
Data-R-Us. Download R. It will even work with Windows. If you have ever used a scripting language and had fun with it, R will double it with data. Statistical analysis or converting data in to graphics is a key skill and requirement.
Home work –
Download the “State median income by family size” data. Plot, using R, the US-wide median income for 3-person vs 4-person families in a x-y plot.
(3) Data Compression –
People trade over $2000 worth of goods in Ebay every second. Facebook creates 15 TB of data every single day. At that scale, using left-over CPU to asynchronously “compress” the data makes perfect sense. Typically, most advanced algorithm could compress data 2X-5X.
Data Compression also makes data retrieval more efficient.
- Facebook compresses the data stored in Hive using gzip over Hadoop SequenceFiles. Try get an understanding of Hive.Speaking of Hadoop, Yahoo distribution of Hadoop looks like will eventually win the race to integrate ANSI SQL Standards based query.
- If you are using DBMS, try Oracle Advanced Compression. It’s pricey — but it does put the minimal overhead on compressing data synchronously for atomic transactions.
(4) “New” SQL –
SQL, surprisingly, is older than C. Even though people debate DBMS future, SQL is here to stay. Its advantages are its unique simplicity (plain English); and millions of people who could already write a line in it. In fact, even all the next generation data stores try to implement at least a “SQL Like Query language”.
Sadly, in programmer community, people do not generally get the fact that SQL has evolved a lot in especially last 8 years. For example, there are “Analytic Functions” in SQL that could literally replace 1000 lines of code to rollup data over different windows with 4 lines of “Analytic SQL”. XML structures can be built within database very fast using either a Java or a C based wrapper to prevent huge middleware load. SQL Standards 2008 have finally embraced “WHEN IN CASE” — which means one can replace “IF – THEN – ELSE” loop within a single line of SQL. Regular Expression based searches are at least 4X faster in SQL over a large set of data.
SQL is the most used programming language of all time, if we count number of people who ever wrote a line. Learn the following — all known databases and even some new “key-value stores” do have these —
- CASE statement (IF-THEN-ELSE loop within SQL)
- MERGE statement (UPSERT — UPDATE failing INSERT)
- Use Analytic Functions to paginate in a web application or find out the “Top 15 Percentile of Companies per total annual invoice receipts”
- Statistical Functions in SQL – like DENSE_RANK; CROSSTAB (produces PIVOT tables); NORMAL_RAND (even Postgres has them)
(5) Data Visualization –
The ability to “let data talk” through images, maps, charts and very innovative non-verbal representations will be one of the biggest marketable skills in face of rapid, almost unmanageable, rate of data growth.
- Finish the “Data Visualization” course from Harvard University. It will take 20 to 40 hours, depending on whether you would also finish their home work.
- Edward Tufte seminars are a bit expensive than the online course, but if you are the kind of person to learn more in a “group” those are very good too! To save money, you can also borrow the books and notes from one who already had attended.
To be Cont.