State of Data #72

#analysisThe most comprehensive guide to Mobile Statistics ever 

A trip to the darker sideAnalyzing Hackers – what do they talk about the most vs. real life threats, and how precious data is traded at ‘black market’ (PDF) 

Peter Norvig on how AI is revolutionizing search

Google Voice Search relies on 230 billion real world search queries to learn all the different ways that people articulate given words. So people no longer need to train their speech recognition for their own voice, as Google has enough real world examples to make that step unnecessary.

Finally, the words are strung together into a language model, which tells you which words are most likely to come after another word. There might be a soundwave that sounds like either “city” or “silly”, but if it follows the words “New York…” then the language model would tell us that “city” is more likely. 

#big_dataMcKinsey asks ‘Are you ready for the era of Big Data’ with five questions (Jeopardy think music begins here) – 

1.      What happens in a world of radical transparency, with data widely available? 
2.   If you could test all of your decisions, how would that change the way you compete?   
3.     How would your business change if you used big data for widespread, real-time customization?
4.    How can big data augment or even replace management? 
5.     Could you create a new business model based on data?

#conferenceAI Challenge 2011 –“watch your ant colony fight for domination against colonies created by other people from around the world

#Data_Science –  
What happens when your business is “Statistics as a Platform 

Want a database to play dice and just pick just a random plan (NOT the best plan) for a query? There’s a hidden parameter for that.

#idea 3 Experts offer ideas on ‘Competing through Data’. Insights – 

·   Most great revolutions in science are preceded by revolutions in measurement 

·   one-standard-deviation increase toward data and analytics was correlated with about a 5 to 6 percent improvement in productivity and a slightly larger increase in profitability 

·  I can have all the data I want to have—but I still have to communicate it to our players.   

‘Signs that you’re a Bad Programmer’. Best insight –

Inability to think in sets

Transitioning from imperative programming to functional and  declarative programming will immediately require you to think about operating
on sets of data as your primitive, not scalar values.


Funny enough, visualizing a card dealer cutting a deck of cards and interleaving the two stacks together by flipping through them with his thumbs can jolt the mind
into thinking about sets and how you can operate on them in bulk. 

Want a handy Baseball correlation ellipse or Image Scatter Plot (like how flu spreads) to communicate data – R Graph Gallery is there to help.


State of Technology #30


·         Opera congratulates Amazon on ‘finally catching up’ with Cloud-based browser

·         How Apple Ads are different from others – they don’t show blank screens even on Computer devices’ photos; and why they chose iPhone 4s screen to be 3.5” (hint: think of your thumb)

#architectureLearning from (others’) mistakes — How Etsy overcame poor Architectural choices 

Why taking time out from regular programming activities, like 10% side-project, is important – co-founder of O’Reilley publishing walks through. “Anything that is dismissed on the grounds of the technology-not-being-good- enough-yet is going to happen” —Ben Hammersley

#designGreat index card set for bringing psychology to Web Design; first seven cards are free

#essayThe Great Eight: Trillion-dollar Growth Trends to 2020 – Bain looks at the evolving Macro-trends. Here’s an immediate insight – The next billion consumers are not “another billion.”

#mobilePractically all Design Patterns for iPad and iPhones – Bar, Button, Carousel, Chat, Curl to good old Tables

#saasSecure Coding Guidelines for Web Applications from Mozilla #recommend

#social – Very insightful Lecture Summary (quoted in #code section above) to a group of CyberSecurity practitioners –

We assume that every meal we eat, every hotel bed we sleep in, every piece of culture we consume, is something we can have an opinion on, and have it be given the same importance as an opinion from anyone else. There are rating sites online for you to rate just about anything, legal or not, and the sheer weight of amateur reviews outdoes the professionals for authority most of the time. 


#toolDoes one scroll left/right/up/down to view major content of your site in iPad2? What about on Chrome browser in a Netbook? Where’s the fold measures it in a jiffy.


#tweaks n’ hacksThe Sketchbook project could be a time sink – “offering to scan people’s sketchbooks they’ve motivated a community of artists from all over the world



#parting_thought“When you’re 1% there, you’re almost done. Just seven doublings away”Twitter Verse  





State of Data #71

#analysisHow to (or, not to) predict Nobel Prize winners, year after year

Is cycling unhealthier than driving? Statisticians argue.

Comparison between Terracotta and Memcache (disclaimer: from Terracotta, but mostly in line)

Could teaching MapReduce be good for beginning undergraduates? 

The basic issue is that Google’s narrow MapReduce API conflates logical semantics (define a function over all items in a collection) with an expensive physical implementation (utilize a
parallel barrier). As it happens, many common cluster-wide operations over a collection of items do not require a barrier even though they may require all-to-all communication.  But there’s no way to tell the API whether a particular Reduce method has that property, so the runtime always does the most expensive thing imaginable in distributed coordination: global synchronization.

In PASS 2011, Microsoft declares partnership with Hortonworks to integrate with Hadoop 

#Data_Science –   Pandas is a New Data Analysis toolkit in Python (on that excuse a moment of cuteness)


#Finally We got a Bayesian Nobel Recipient.

#DBMSDo you secure the data or the software? Security expert Pete Finnigan postulates it is always about securing data (PDF) in a lecture, out of all places, in Bletchley Park.

Why ‘form factor’ of data should evolve into ‘of courseness’

The old players, much like the old Facebook, presented information in a way the filesystems or databases viewed it. Apple refactored this data back into a familiar interface; one more similar to the ways humans interacted with the same language prior to it’s digital translation.  

Must read new book in Safari – ‘Big Data Glossary’ – where else could we learn about Hypertable, MapR, OpenNLP, Fusion Tables and BSON all in the same place! 

#visualizationDo word 
clouds add genuine value or merely are “mullets of the Internet”? 


  • Could someone predict what TV channel you’re watching from your smart meter data? This paper (originally in German) shows how they did it.
  • What TV shows everyone is talking about? Now, what TV shows Rihanna Fans or Diet Coke drinkers are talking about? Social Data/TV Leaderboard reveals it all!
  • Data + Viz + Social Awareness + Gamification – Living within means with
  • How data is helping babies (and their parents!) to sleep better

State of Data #70

#analysis – Gamers’ Search Pattern

#architecture – Google offers MySQL on Cloud

#big_data – Hotmail deployed SSD (Solid State Disk) to manage frequently-accessed metadata (read/unread status etc) of 1 petabyte of emails a week.
“One technology that is promising in this area is Flash Storage (also called SSD, or Solid State Drive). SSDs use technology similar to what you’d find on an SD card or USB stick, but with a faster internal chipset and a much longer lifespan. A normal hard drive can perform a little more than one hundred read/write operations per second, whereas some of the fastest SSDs can do over one hundred thousand operations per second. 

#conference – 
All Oracle Open World / Java One 2011 sessions are online now 

#Data_Science –   
Benford’s Law and Recreasing Reliability of Accounting Data

 – a) good, impartial overview of Oracle noSQL database (with dev team members commenting on bottom)

b) technical white paper from the vendor (PDF)

Like many NoSQL databases, the Oracle NoSQL Database is configurable to be either C/P or A/P in CAP.
Unlike Dynamo, SimpleDB, Cassandra, or Riak, the Oracle NoSQL Database does not support eventual consistency.
Like most NoSQL systems, the Oracle NoSQL database does not support joins


#idea – Monkeys free your data from Major Social Networks and sends a JSON dump to your email box

#learning – 
For people who present data to others to influence, this 2-minute video can change your outlook. See how just a simple change in color moves brain to process in 250 ms cycles, rather than in 10 sec cycles.





#visualization – World’s Most Expensive Cities’ Real Estate Data Visualized and compared to US




State of Data #69

#analysisWhich telecoms store your mobile data the longest?

#architectureNew Hadoop-based Canonical Data Ecosystem in a nutshell –

top level for focused on analytics, while companies such as Cloudera, EMC Greenplum and MapR operate on the lower level with their Hadoop distributions that focus on cluster management and performance

#big_dataOracle on NoSQL ‘hype’ (PDF)

#conference(Good, Inexpensive, Local)  5th XLDB Conference & Workshop; Oct 18-20, SLAC, Menlo Park
XLDB stands for eXtremely Large DataBases. The lead organizer, Jacek Becla, seems to have started XLDB because he has 100 petabytes of astronomical data to plan for”  (Hat tip: Curt Monash)

Moneyball 2? Great analysis of Basketball shooting strategies – The problem of shot selection in basketball: “The shooter’s sequence” (PDF)

“Inspired by these recent discussions, in this paper I construct a simple model of the “shoot or pass up the shot” decision and solve for the optimal probability of shooting at each shot opportunity”

First issue of PostGreSQL Magazine is now out


#idea Is your Social Network bigger than number of folks who worked in Bletchley park? How does it compare with number of people saved from Titanic? Window shopping for numbers in

Memo from Instagram co-founder on how they handle sharding and unique ID’s, using Django and PostgreSQL.  They store 25 photos every second. (Hat tip: Vik Patil)


#visualizationBig Data Opportunities across industry segments