State of Data #116
September 14, 2012
Reddit’s Database has only Two tables
“they use two tables for each “thing”, so a thing/data pair for accounts, a thing/data pair for links, etc.”
“He [Mendel] was most anxious to have his results replicated and expanded, for even self-possessed people (and he wasn’t) entertain occasional misgivings about the accuracy, originality, and significance of their work.
To achieve these goals, his work had to be understood. In comparison to his theories, of whose validity he was sure, the data were of no significance whatsoever.”
Cool Algorithms: How toEstimate Cardinality of Large Datasets
Ads and the City: Considering Geographic Distance in Recommendations (pdf)
“..in human mobility, we learn two insights: 1) there are special individuals who visit many places; and 2) individuals go to a venue not only because they like it but also because they are closeby.
We model these insights into two simple models and learn that: 1) simply recommending power users works better than random but is far from producing the best recommendations; 2) an item-based recommender system produces accurate recommendations; and 3) recommending places that are closest to a user’s geographic center of interest produces recommendations that are as accurate as item-based recommender’s”
Tom Kyte hands over the ‘keys to Oracle’ – and it is free
“I came away convinced that the geographic data Google has assembled is not likely to be matched by any other company. The secret to this success isn’t, as you might expect, Google’s facility with data, but rather its willingness to commit humans to combining and cleaning data about the physical world.”