State of Data #116

Reddit’s Database has only Two tables

they use two tables for each “thing”, so a thing/data pair for accounts, a thing/data pair for links, etc.”


Gregor Mendel’s Suspicious Data

“He [Mendel] was most anxious to have his results replicated and expanded, for even self-possessed people (and he wasn’t) entertain occasional misgivings about the accuracy, originality, and significance of their work.

To achieve these goals, his work had to be understood. In comparison to his theories, of whose validity he was sure, the data were of no significance whatsoever.”

Big Data

Cool Algorithms: How toEstimate Cardinality of Large Dataset

Data Science

Ads and the City: Considering Geographic Distance in Recommendations (pdf)

“ human mobility, we learn two insights: 1) there are special individuals who visit many places; and 2) individuals go to a venue not only because they like it but also because they are closeby.

We model these insights into two simple models and learn that: 1) simply recommending power users works better than random but is far from producing the best recommendations; 2) an item-based recommender system produces accurate recommendations; and 3) recommending places that are closest to a user’s geographic center of interest produces recommendations that are as accurate as item-based recommender’s”


Tom Kyte hands over the ‘keys to Oracle’ – and it is free


HBR’s ‘Big Data’ Insight Center


How Google builds Maps and provides Directions

“I came away convinced that the geographic data Google has assembled is not likely to be matched by any other company. The secret to this success isn’t, as you might expect, Google’s facility with data, but rather its willingness to commit humans to combining and cleaning data about the physical world.”


Some great Data Visualization Tutorials from Flowing Data


