Double Edition : State of Data -#48, 49 (Break for next 3 weeks)
May 13, 2011 Leave a comment
#analysis – Felix Salmon brilliantly analyzes “Grouponomics” – if you do not buy wine, restaurant loses — “diners paid $15 for their Groupon — which gave them $30 of food..So even after knocking $22.50 off the bill (remember that Giorgio’s kept $7.50 of the proceeds of Groupon), the restaurant would often still make money”
Design of Large Scale Log Analysis (PDF) from Microsoft – if you ever need to glean into web server logs, or behavioral logs or want to see ‘what logs cannot tell us’ this is a good resource to ratify. This editor’s favorite analysis fallacy (Simpson’s Paradox) is mentioned as well.
#architecture – Why Guardian chose MongoDB
One of the most scalable, performing and challenging “integration” problems ever – solved within dated infrastructure – “The Incredible delivery system of India’s Dabbawallahs” – there are SO MANY patterns to learn about (data) movement as well from here.
#big_data – Here is to the huge potential hidden within Google Maps Directions Logs – “massive logs of people asking for directions from A to B,… And, it appears this data may be as or more useful than user reviews of businesses and maybe GPS trails for local search ranking, recommending nearby places, and perhaps local and personalized deals and advertising”
The paper referenced above is a good read too – “at least 20% of web queries have local intent”, “time-aware scoring” – how one gets results back depending if the search for ‘beer’ was made during 10AM, Monday vs. 10PM, Friday etc.
#DBMS – How StackOverflow made pages 100x faster by….SQL tuning
Talking of tuning, full “Oracle Performance Tuning” Course (on Video) is now available on Safari
#learning – Two good tools for text analysis — Word Frequency Lists and SentiWordNet
“Presentation on Drizzle by Brian Aker who led MySQL until Oracle acquired Sun. Interesting observations on not only database but best practices and prevalent approaches in the industry (replication, virtualization, etc.)”
#visualization – “How Quick Can We Be – Current Data Visualization Techniques for Front-end Engineers” – shows some neat tricks with OpenHeatMap, Fusion Tables and Google Charts — slide-deck from JS Conf 2011 (full conference slides available here)
How to solve problems with Visual Analytics (PDF; 25M) – free ebook from Vismaster, European consortium for data visualization
#etc
- Whoa! Worldwide, 27M enterprise servers process 10 zettabytes a year (and this information is from 2008), that is roughly 64TB per company. The beautiful “How much information” (PDF) report was published last month
- And why it will go on and on.. Because of aggregations like this – 37,000 photos of the “universe” combined together to create a magnificient 5 gigapixel snapshot
- Self-congratulating? Data Visualization on Data Scientists
- Morphology of online ‘re-targeting’ – how ads keep chasing people
- Visualizing 38M deaths as World Cuisine
- Big Data, Little Problem – 3 months of HD video in ONE second – 109 TB/sec over-the-wire speed achieved
- R “R” Us – A good new way to learn R
- Data & Serial Killers – Bill James applies his data science