State of Data This Year – 2011

 Last issue of the year is a revisit of 48 newsletters published in 2011. Happy Holidays! #2011_Books_on_data_and_else


  1. Checklist Manifesto – What works in ER should work in your project. Goes along the aphorism – ‘What can be measured, can be managed’. Very handy for complex projects like moving to new data center + tech stack refresh.
  2. *** Thinking, Fast and Slow – We float in data but decide by gut. How same data, but different representation could change the decision. e.g.,A vaccine would save 90% impacted vs. it would kill 10 – may produce different emotional outcome even though both inputs are logically analogous. Book of the year, perhaps of the decade by Nobel winning behavioral psychologist.
  3. Information – ‘The universe has so far done 10^120 “ops” to create roughly 10 ^90 bits of data’ is just one of many things one will learn.
  4. Think Stats – One of the best refreshers for wannabe Data Scientists.
  5. Race Against the Machines – Why ‘Big Data’ problems could all just go away on its own, and ‘optimization’ may become a lost art. Machines are taking over. Scary, thoughtful and with wonderful chessboard analogies. Best $2.99 to spend this holiday season!


  1. Moneyball – ‘Analytical, sabermetric approach to Baseball’, and entertaining.
  2. Joy of Stats – From the most entertaining statistician ever, perhaps.
  3. Sort Algorithms explained by Hungarian Folk Dancers
  4. Brilliant Machine Learning Demos
  5. How Negative reviews could improve sales


  1. FREE Classes from Stanford – Data & Others (go to bottom to see full list)
  2. VideoLectures – ‘selection in machine learning is particularly noteworthy. Look for the title “For those about to Machine Learn” halfway down the page. It takes you the “Machine Learning Summer School” a collection of tutorials by the leading authorities in the game’
  3. MIT OpenCourseWare – ‘In particular, look at this
  4. ‘Finally, MIT announced a new initiative that will begin Spring 2012 that provides students with self-paced instruction, laboratories, and an opportunity to earn a certificate of mastery.’
  5. Data Trivia Hunting – Quiz #1#2#3 from ‘Significance’

#2011_data_misc (compilation from past issues of 2011)

State of Technology #38

#Happy_Holidays – After the break, newsletter will continue ~bi-weekly in 2012
#at_other_places –

#architecture – Netflix Platform Architecture
#code – Understanding Java Garbage Collection – 90 minute talk, but you can zip through the first half

#design – Invisible side of Design  – ‘we tend to get distracted by aesthetics of our designs, and often do not pay enough attention to the other, invisible side of our creations’

#essay – 
When can you expect a Quantum Computer on your desk MIT scientist responds.

‘ of the most-celebrated quantum computations to date has been to factor 15 into 3 times 5 — with high statistical confidence!’

#mobile – 
Brief rant on the Future of Interaction Design – why hands are our future

#saas – Renault opens up ‘Car as Platform’

#social – How ‘evil Crowdturfing’ is taking over social media – Detection of Hidden Paid Posters (PDF)

“They discovered that paid posters tend to post more new comments than replies to other comments. They also post more often with 50 per cent of them posting every 2.5 minutes on average.”


#tool –  2011 Ultimate Developer & Power Users Tool List (for Windows), and here’s how you can run IE6 on Windows 7 (say, to triage a customer issue) –

‘ version of Windows Virtual PC lets you run Windows XP applications next to your Windows 7 apps for the ultimate in backward-compatibility. I like Virtual XP for when I want to run XP apps like IE6 seamlessly along site my Windows 7 apps.’


#tweaks n’ hacks – Spot a fake Rolex – by NFC (Near Field Communication)



#parting_thought – Know that the Internet has no eraser” — Liz Strauss







State of Data #78

#analysis – Data Informed, Not Data Driven” – How analytics play a critical role in Facebook design decisions. 

Why photo upload “success rate” is so important on Sundays (people do ‘important’ stuff on weekend and upload 150% more pics on Sundays);

‘..very difficult for a set of metrics to fully represent what you value

#architecture – How to build a data mining web app –

“Well, here is a source code that deploys your app with one command on Google App Engine. You just need to focus on where to get the data (ETL), what to do with it (DM), and how to display it (VISUALIZATION). The source code has example that you can swap with an idea of your own.”

 – How 2012 will be for Big Data – Predictions –

a)     Technology – 5 Big Data Predictions for 2012 – Streaming Data Processing

b)    Volume – From IDC – “Big Data will earn its place as the next “must have” competency in 2012 as the volume of digital content grows to 2.7 zettabytes (ZB), up 48% from 2011”

c)     Effectiveness– From Gartner – “Through 2015, more than 85 percent of Fortune 500 organizations will fail to effectively exploit big data for competitive advantage.” 

#Data_Science – Analyzing email to find close friends

‘..took all the e-mail data from an international firm and for one of its offices asked employees to list the people in their social network, dividing the list into friends, colleagues, and acquaintances.

Then Uzzi and Wuchty scanned the workers’ e-mails, for each recording sender and receiver, the time it was sent, and the time it took for the receiver to respond.

The researchers found that both methods—the volume threshold and the response criterion—did a fair job of approximating the social networks the employees had reported themselves.

But then Uzzi and Wuchty tried something new. Instead of looking at the absolute values for volume and response, they looked at the response time.. The new method predicted who was in different employees’ social networks with an accuracy that is several percent higher than the other methods’’

#DBMS – Cloud Storage Benchmark (PDF)


#idea – Management by Statistics

“Lots of folks play fast and loose with statistics to make political points. If I told you the United States has lost most of its manufacturing jobs, is that a problem? What if I told you the United States manufactures the most in the world, but manages to do so with the fewest number of people? (Much like how the U.S. produces the most agricultural goods, but uses very few people to do so) Would you still think that is a problem? You could argue this either way, of course, but the point is that the same observable reality can be presented in various ways, thereby slanting the story.”


#learning – How to ‘cook’ with Data – “Same data, same map, different stories

“As you can see, for each definition of class limits you get a different message. Most people just use equal intervals, but that’s lazy, IMHO. Using equal intervals in a choropleth map is like sorting a bar chart alphabetically. The only thing that is worse than equal intervals is equal intervals plus round numbers.”


#visualization – What the world searched for in 2011 – Casey Anthony to Fukushima



State of Data #77

#analysis – Where is Microsoft’s growth going to come from?

#architecture – ‘Why some execs think Hadoop ain’t all that…’

– Is 42 still the answer to everything? Amazon’s cloud is world’s 42nd fastest supercomputer.


#conference – NPR series on Big Data  –  Part 1 – Big Data & Data Scientists; Part 2 – Demand of Skills

–  How to tell the ‘Degree of Photoshop-ness’ a picture had? Could someday a legislation be passed to post this metric for all the ‘touched up’ photos? Data  science can help to figure out the ‘degree of Photo re-touching’ (PDF paper) 

Thanks to the magic of digital retouching, impossibly thin, tall, and wrinkle-free models routinely grace advertisements and magazine covers with the legitimate goal of selling a product to consumers. On the other hand, an overwhelming body of literature has established a link between idealized and unattainable images of physical beauty and serious health and body image issues for men, women, and children.

#DBMS – NoSQL 101 – A 45-minute Introduction to Beginners 


#idea – Machines apparently could happily be creating even more data just if pesky humans had smarts to understand it

#learning – Anscombe’s Quartet is just one reason why graphing could ‘change’ the ‘face of data’ – ‘four datasets with identical statistical properties, but appear very  different when graphed’ 

– Just in time – Gift Card Data Visualization – ‘it is worth less than you think’



·       Graphic Explanation of Bayes Theorem

·      ‘Correlation is not Causation’ lesson from Dilbert-Wally

·      Statisticians could go to jail for telling the truth; A memorable line  -“Unfortunately,in Greece statistics is a combatsport

·    How to predict when a car would run red light


State of Technology #36

#at_other_places –

#architecture – 
Known Scalable Architecture Templates

Gossip and Nature-inspired Architectures – This model follows the idea of gossip in normal life, and the idea is that each node randomly pick and exchange information with follow nodes. Just like in real life, Gossip algorithms spread the information surprisingly fast. Another aspect of this are Biology inspired algorithms. Natural world has remarkable algorithm for coordination and scale. For example, Ants, Folks, Bees etc., are capable of coordinating in scalable manner with minimal communication. These algorithms borrow ideas from such occurrences. The paper “From epidemics to distributed computing” discusses the models.”

#code – 
Build a Chrome Extension in < 60 min

#design – HTML5 Showcase for Developers – The WOW and the HOW

#essay – Clayton Christensen mentions Turbotax and Quickbooks (‘where jobs are very stable over long periods, not vulnerable to product life cycles’; 58’) on ‘Reinventing IT’ – 5 themes on how ‘little boys can beat big giants’ —

  1. Disruption
  2. Compete against Non-consumption
  3. Supply chain disruption
  4. Target job, not the customer
  5. Catch the tide of decommoditization

#mobile – Why CTR (Click through Ratio) is so high in Mobile – Fat finger fumbling — “Ohh yes… when you are pinch-zooming and do it a bit sloppy.. or when scrolling and accidentally tap instead…”
#saas – You only control 1/3rd of your Page Load Performance – You cannot rely on big 3rd party providers to always deliver high performance. You should be aware of the problems that can occur if you put Third Party Content on your page and you really have to take action.”

#social – 
VP Engineering Netflix gives a detailed analysis of 5-star rating system

“However, your users will rate less content the more options your rating system provides. You will get more ratings if you give your users fewer choices, so a 10 option system turns out to be worse than a 5 option system. Netflix uses a 5 star rating system. At a couple of points in the product history, Netflix made available or tested half stars.”


#tool –  Worst Offenders – ‘Tracking the Trackers’ – things that slow down web the most – from one of the best browser tool ‘Ghostery’

#tweaks n’ hacks
 – An Illustrated Guide to Cryptographic Hashes


  • Email vs. Humanity : 0 – 2: Don Knuth scores the first; Atos scores the second.


#parting_thought –  Everyone has an invisible sign hanging from their neck saying, Make me feel important. Never forget this message when working with people’ – Mary Kay






State of Data #76

#analysis – Analytics – The Widening Divide (MIT-IBM Study) –

  • Aspirational companies just thinking about analytics, 
  • Experienced companies with some solid progress on analytics  
  • Transformed companies, those with advanced capabilities and significant results. 

#architecture – Realtime Data Mining at 120,000 tweets per second –

“Data is extracted out of the firehose and normalized. Twitter data is highly dimensional, it has 30 plus attributes and you get access to them all. These attributes include geolocation, name, profile data, the tweet itself, timestamp, number of followers, number of retweets, verified user status, client type, etc.”

“All males would be 50% of the firehose or 125 million tweets as there’s a 50-50 male female split on Twitter. Creating a filter for all tweets made by males would not be bright. It would be very expensive. What you want to do is look at use cases. Are you a bank, are you a pharma, and figure out what you are interested in specifically.”


 – Drowning in Data (video; 2’21”) – ‘bandwidth’ of Tweet creation is 46MBPS; Big source of big data is now wireless sensors; Do we need to the biggest unit of measurement (Yotta; 24 0s after 1 byte)?


#conference – Most interesting papers from InfoVis conference 2011

#Data_Science – 
Teaching Statistics’ (PDF) from Andrew Gelman is a must quick-view. E.g., why are counties with highest kidney cancer mostly in the center-west? Then, how do you explain center west region also has lowest kidney cancer death rates?

#DBMS – A Two-Year Case Study on NoSQL

#idea – Data is the new .com?

“..mere presence of data is not itself an indicator of having deep and relevant data DNA. “Hey, our business generates a lot of data, BIG DATA” is a phrase I hear frequently which I assume is supposed to get me excited. It doesn’t. “Hey, we’ve got this thesis that as our business scales we’re going to build a monster data asset that can better help us attract, retain and monetize happy customers. It will help us create competitive barriers and we’re planning for this from Day 1. We’ve shared some early data with a data-hacker buddy and feel this is a promising avenue for building company value.” Hey now, NOW you’ve got my attention.”

#learning – 
How to do heat map like ‘Color Scales’ in Excel

#visualization – Bloomberg is DATA – the “About” page of a company transformation


  • Music of Math – Google Engineer Alexander Chen created – “playable visualization of the famous first prelude from Bach’s cello suites. “Using the mathematics behind string length and pitch,” Chen explains, “it came from a simple idea: what if all the notes were drawn as strings?”

  • Spot data ‘overfitting’ in analysis

  • #censusData How did recession impact state-to-state migration?

  • Wealth of Data != Data of Wealth (Hat tip: Siddharth Ram)