State of Data #83

#analysis – Adventures of DataThief (useful app that allows scan graphs and extracts data points) –
“Most of the past week has been spent convincing myself that it doesn’t really matter how I analyse my data because the results come out the same regardless. This is reassuring for me, but it doesn’t mean that somebody else, looking at my data with fresh eyes and a different perspective, would not come to an entirely different set of conclusions.”


Great Google Insights


#architecture – The Present and Future of Mobile BI

#big_data – To run tests or analysis or real-like names, you can order fake names with country, gender, name set and exclusion rule sets specified.

#Data_Science – Waffles – A Massive Collection of Command-line tools for Machine Learning & Data Mining 

 – A Quick yet Comprehensive Review of Hadoop Ecosystem


#idea – The danger of quick analysis – Did Antidepressant prescription rates surge 60% in Britain? Quick Reaction would be significant increase in depression rate. What if we probe and see pretty much every other drug prescription surged similarly?

#learning – Making Data Meaningful – Beautiful series of Data Guidelines from United Nations (pdf) –

  1. Guide to writing stories about Numbers
  2. Guide to Presenting Statistics
  3. Guide to Communicating with Media

#visualization – Social Networks, compared with Countries, Events and Religions




State of Technology #42


#architecture‘What is it like to have an understanding of very advanced mathematics’ insights are almost fully portable to ‘understanding of advanced technologies and how it work’

  • Responsive – ‘You can answer many seemingly difficult questions quickly’
  • Confidence – ‘You are often confident that something is true long before you have an airtight proof’
  • Assurance – ‘You are comfortable with feeling like you have no deep understanding of the problem’
  • Structured – ‘Your intuitive thinking about a problem is productive and usefully structured’
  • Analogize – ‘When trying to understand a new thing, you automatically focus on very simple examples’

#code – *****
Remember that awesome demo of ‘Sixth Sense’ (7M views in Ted) from Pranav Mistry of MIT a few years ago? Now the code has been open sourced.

#design5 Most Insightful Lessons from Interaction Design’s success in 2011


#essay9 Common Mistakes in Difficult Conversations

Google now has Android ‘Design Central’ for developers.


#saasWhy most ‘Choose the most difficult password’ policies are mostly bunk and overlook the scariest attacks

#social – *** Secrets behind Apple’s Most famous Icons – e.g., iPod “Artists” logo is a silhouette of Bono.


#tool8 of the ‘Best of CES 2012’ Technologies – Gorilla Glass covered!


#tweaks n’ hacksServer-side JavaScript Injection (pdf) – from Blackhat 2011


#parting_thought ‘Three orders of magnitude in machine speed. and three orders of magnitude in algorithmic speed add up to six orders of magnitude in solving power. A model that might have taken a year to solve 10 years ago can now solve in less than 30 seconds.’   –    In pursuit of the Traveling Salesman










State of Data #82

#analysisThe Rapidly changing landscape of Mobile Data – nice aggregate of latest values of the trends 

#architectureZero to Hadoop in 5 minutes

#big_dataExtract from ‘Too big to Know’ – a new book on Big Data and its impact on our brains – ‘designed the Eureqa computer program to find equations that make sense of large quantities of data that have stumped mere humans, including cellular signaling and the effect of cocaine on white blood cells. Eureqa looks for possible equations that explain the relation of some likely pieces of data, and then tweaks and tests those equations to see if the results more accurately fit the data. It keeps iterating until it has an equation that works.’


#Data_ScienceIn Defense of Online Anonymity – Disqus data shows pseudonymous commenters are the best

#DBMSWhy RAID is so important for databases – A Primer

Statistician who is building algorithm to forecast when someone will go back to committing a crime – ‘algorithm that forecasts a particular outcome—someone committing murder, for example—Berk applied a subset of the data to “train” the computer on which qualities are associated with that outcome. “If I could use sun spots or shoe size or the size of the wristband on their wrist, I would,”

#learning40 years of boxplots (pdf) –  ‘Boxplots use robust summary statistics that are always located at actual data points, are quickly computable (originally by hand), and have no tuning parameters. They are particularly useful for comparing distributions across groups.


  • When compassion trumpets data – ‘Doctors don’t really have a clue how to predict how long a patient will live.’ Actual paper (PDF) –  ‘A patient is eligible for hospice care if they have an estimated life expectancy of six months or less. .. the actual length of stay is usually less than six weeks’
  • ‘Top 1%’ is really mostly about ‘Top 0.1%’ – The growth in 1% is mostly sustained by the 0.1%  

State of Data #81

 #analysis – 1) Full Course Text of ‘Advanced Data Analysis’ from University of Michigan, including relevant course work with ‘R’                       
    2) The final online review depends a lot not on the product, but on…early reviews. People start reacting by saying the opposite of early reviews.

‘studied 51,854 reviews contributed to Amazon, covering 858 books from 2000 to early 2004. We found that the order in which reviews are written matters a great deal: Some newly posted reviews tend to disagree with existing reviews, instead of only focusing on the book.;


#architecture – IBM’s Architecture for Astronomical Big Data

‘A main design challenge is how to process one Exabyte of raw data per day. This is the data amount anticipated when the SKA system as the world’s largest and most sensitive radio telescope will be ready; it’s construction will start in 2016. IBM claims that this data amount exceeds the entire daily Internet traffic. The amount would suffice to fill over 15 million 64 GB iPods.’

#big_data – Can Data Science predict Hit songs? Hey ya! They say you can ‘score’ your own song real soon. Insights –

  • Before the eighties, the danceability of a song was not very relevant to its hit potential. From then on, danceable songs were more likely to become a hit. Also the average danceability of all songs on the charts suddenly increased in the late seventies.
  • In the eighties slower musical styles (tempo 70-89 beats per minute), such as ballads, were more likely to become a hit.

#Data_Science – PageRank algorithm to find the ‘Best Cricket Team’ (pdf) and ‘Best Captains’ in different formats of the game

 – Jonathan Lewis’ ‘Oracle Core: Essential Internals’ has already been dubbed ‘likely be the best Oracle internals book out there for the coming 10 years’ by folks who are top of the trade. 

– Next time you go to a doctor for physical, your data collection may be ‘gamified’ and a whole lot more fun thanks to TonicHealth

#learning – 
Modeling with Data – Tools and Techniques for Scientific Computing’ – now full book available from the author.

‘When I talk to a statistician, a model means a probability distribution over elements, and that’s about it. I’d start talking to a statistician about modeling subject-specific knowledge about the interaction of elements, and giant question marks would appear over his head. Which is not to say that the person is a moron, but just that his understanding of the meaning of the word model is much more narrowly focused than mine.’

#visualization – Visualize CPU Utilization in a Large Data Center – models and approaches


State of Technology #40

 #at_other_places –

  • Mayor Bloomberg learns to code
  • When running on treadmill, do you want to pass by Mayan Ruins, Roman Coliseum or San Francisco sunset? There is an app for that.

#architecture – 
*** ‘Everything I Ever Learned about JVM Performance Tuning @Twitter’ – is a brilliant presentation with very useful tips with equal focus on both young and old generation tuning.

#code – 
How would you write the ‘damn cool algorithm’ of ‘Fountain Codes’?

‘A fountain code is a way to take some data – a file, for example – and transform it into an effectively unlimited number of encoded chunks, such that you can reassemble the original file given any subset of those chunks, as long as you have a little more than the size of the original file.’

#design – 14 of the Best Ideas in Interface design (2011)

#essay – 
IBM 5-in-5 – Five Innovation forecasts that ‘will alter the landscape in 5 years’ –

  1. People power is live
  2. No need for any more password
  3. Mind reading is doable
  4. ZERO digital divide
  5. Junk mail is the new priority mail

#mobile – Do you know more than 3 out of 10 most popular Mobile websites (for Nov, 2011)? It may surprise you.
#saas – Be careful – QR code is fast becoming the new malware entry point


#social – Crowdsourcing is in effect to find out which bars serve adulterated drinks in Spain — ‘a site that aims to let consumers call out the ones serving drinks made from adulterated or unauthentic ingredients.’


#tool – 1PasswordPro was voted best iPad Utility App for 2011, and it is rock solid useful

#tweaks n’ hacks – You’ve got Smell – Olly is a Web Connected Smelly Robot that converts notifications into smell.



#parting_thought – On how much individual glorification really matters for big things, and success, like failure, could often be ‘naturally emergent’ phenomenon –
“When we hear that a raging forest fire has consumed millions of acres of forest, we don’t assume that there was anything special about the initial spark” – Duncan Watts


State of Data #80

#analysis1) How they predict bugs at Google

‘Bug prediction uses machine-learning and statistical analysis to try to guess whether a piece of code is potentially buggy or not, usually within some confidence range. Source-based metrics that could be used for prediction are how many lines of code, how many dependencies are required and whether those dependencies are cyclic’ 


2) You can get your emails analyzed at Tout to view a Dashboard like this

#architectureHadoop for Archiving Email – Part 1; Part 2

#big_dataIs Big Data ready for prime time?

‘Soon, a drug saleswoman will have real-time analytics that tell her to focus on the doctors who spent time on social networks that morning, and who are thus more apt to influence colleagues’

HITS Algorithm to Measure Human Pecking Order

‘focus on the way that interlocutors copy each other’s use of certain types of words in sentences. In particular, they look at functional words that provide a grammatical framework for sentences but lack much meaning in themselves (the bold words in this sentence, for example)’


#DBMSFacebook shares some more tips on scaling MySQL

 #idea On how to find out more than 10 trillion digits of Pi, and how to write when you do

#learning‘Stupid Data Miner Tricks: Overfitting the S&P500’ (PDF)

‘The dark side of data mining is to pick and choose from a large set of data to try to explain a small one. Evil data miners often specialized in “explaining” financial data, especially theUSstock market. Here’s a nice example: we often hear that the results of the Superbowl in January will predict whether the stock market will go up or down for that year. If the NFL wins, the market goes up, otherwise, it takes a dive.’

#visualizationCrowd-sourced shared vision of Future of Technology visualized – all the way to Year 3036 when we expect a ‘Telepathic Society’. And, BTW, most people think ‘cash will be outlawed’ in 2062.