State of Data #68

#analysisHow negative reviews could increase Sales Online (full 30 min video from recently concluded Strata Jumpstart can be watched here) 

One reason is that buyers gain confidence that “if this is the worst this product will throw at me, it must be pretty good.”

#architectureAnalyzing Web Server logs with LogParserwith Charting Option #ReallyUseful


#big_dataData flowing through Mobile devices now outpaces Broadband traffic, and 8 other very intriguing data points

#Data_ScienceWhy and How to Hire Data Scientist – Cathy O’Neal explains – ‘the skills of a data scientist include not only crunching numbers, but also visualizing the results‘- with real-life examples. 

#DBMSMauve should work with noSQL too #Time2Rewrite



 #idea Should ‘Best Practices’ be renamed ‘Default Practices’? James Morle – a key storage expert – ruminates. 

‘It’s time to stop using Best Practice as a substitute for good old-fashioned thinking, and start to implement designs based upon Right Practice: The ‘Right’ Practice for your particular situation.

#learningHow to account for ‘leap seconds’ in thousands of servers in massively distributed environment


#visualizationMaking ‘Motion Chart’ was never easier – with googleVis R package and code example 



  • Data Park- IBM launches Parking Analytics in San Francisco – “30 percent of city congestion stems from motorists looking for parking spaces
  • Facebook stores data including who “poked” you ever, every message you deleted, your religious views and about 800 pages full of other stuff. Many European users can ask Facebook for a CD with their full data. And,they’re doing it. #Privacy
  • Death Curve  #DataViz (Hat tip: Dylan Lewis)
  • Twitter data shows anger/happiness/sorrow/frustration transcends cultural boundaries – ‘mood’ depends on ‘circadian cycle’


State of Technology #26

#at_other_places –

#architecture – 
What is web but a giant IfThisThenThat workflow engine? Scott Hanselman explains.

#code – 
Super-fast tutorial on REST

#design – Must Read – Smashing Magazine publishes free design eBook on its fifth anniversary. 

#essay – Steven Levy writes a long piece on Facebook’s strategy – Facebook is a timeline annotated autobiography of a user.

….will gather and organize the massive amount of data generated by the apps you use to tell your story, minute by minute, day by day, year by years

#mobile – Especially targeted for Mobile – Stanford’s full course on Human-Computer Interaction Design; Series of 43 Video Sessions 

#saas – 37Signals homepage evolution – Time Lapse

#social – Less than 5% users ever change default settings

#tool – BuckyBalls are now BuckyCubes


#tweaks n’ hacks – Behind Intel’s new hardware based Random number generator


  • Moore’s Law or Koomey’s Lawenergy efficiency of computing doubles every 1.5 years”


#parting_thought – “There’s nobody to blame but everybody” – David Ortiz (Red Sox)



State of Data #67

#analysis – Three Secrets of Business Analytics (from 37Signals)

 How Lloyd’s of London uses R for Insurance

#architecture –  How and why a Portland startup went from PostGres to MongoDB and came back (PDF)

This might make some people cringe. Mongo has a single global read/write lock for the entire server. The efect this has is that if a write ever takes a non-trivial amount of time—page fault combined with slow disk, perhaps—everything backs up. We had high lock % when disk %util was only ~30-40%


#big_data – Convert .csv file to MySQL Database

Yelp opened reviews for 7000 businesses, and calling talented Data Miners from Universities to solve problems – e.g., “Top 10 Positive and Negative words ranked”

#Data_Science –   Building Data Science Teams


All the top data scientists share an innate sense of curiosity. Their curiosity is broad, and extends well beyond their day-to-day activities. They are interested in understanding many different areas of the company, business, industry, and technology. As a result, they are often able to bring disparate areas together in a novel way….I’ve seen data scientists apply novel DNA sequencing techniques to find patterns of fraud.

#DBMS – Is Database Design a dying art or a dead art already (interesting comments too)


 #idea – Story behind Opera’s $84M big data funding

#learning – 
“Is the average number of fair coin tosses required to get a HTH (Head-Tails-Head) pattern greater than, less than, or the same as, the number of tosses required to get a HTT pattern?” Peter Donnelly (TED talk) shows how stats fool juries


#visualization – Meta-visualization – what are the most popular types


State of Data #66

#analysisHow Data-driven design decisions will power improvements in Windows Explorer

#architectureHadoop, Big Data and Enterprise Data Warehouse – A Quick State of Technology Orientation


 #big_data – A) How to process a million songs in 20 minutesB) Petasort under 33 minutes    — using MapReduce


 #Data_Science –   Google published amazing “Green Dashboard” to measure environmental impact of things like watching 30 minutes video in YouTube 

#DBMSDatabase architecture of, and – Oracle databases doing 550M transactions/day


#idea How CFOs feel about ‘Big Data’ — “With Big Data,” Tieman says, “you may be spot-on about a problem, but the solution doesn’t magically appear out of the data.”   
#learning – “Visualize This: The FlowingData Guide to Design, Visualization, and Statistics” is now available in Safari


#visualizationHow to find epicenter of an earthquake from Tweets  “shows tweets spreading across the country in the 80 seconds immediately after the earthquake hit; the rate of twitter activity is color-coded from red (most intense) to blue (middling intense) to green (least intense)”  (Note: opens slow, but worth it)



State of Technology #24


·   Linux is now hosted on github 

·   Social Application – Ticketmaster lets you sit with your Facebook friends #Like

·   Jig is a good play to tap social network for small projects

·  RIP Michael S.Hart – Founder of “Project Gutenberg”

#architectureZachman Framework Version 3 is now released


#code – ‘Clever Algorithms: Nature-inspired Programming Recipes’ – want to build next game on Ant Colony System? Free book to read; can be purchased on Amazon too

#designGood design is as Little Design as possible – Dieter Rams’ ten principles to “Good Design”


Good design is innovative
Good design makes a product useful
Good design is aesthetic
Good design helps us to understand a product 
Good design is unobtrusive 
Good design is honest 
Good design is long-lasting 
Good design is consequent to the last detail
Good design is concerned with the environment 
Good design is as little design as possible 


#essayThe Secret Guild of Silicon Valley’ is made of mostly C++ programmers?

They aren’t interested in tweeting, blogging, or giving talks at conferences.  They care about building and shipping code.  They’re
more likely to be found in IRC chat rooms, filing JIRAs for Apache projects, or spinning out Github repos in their spare time.

#mobileAnatomy of a HTML5 app 


#toolCompany scans books you send; sends you the PDF;destroys the book after a year. All for $1

#tweaks n’ hacksStick Figure guide to AES





#parting_thought (uttered >20 years ago) “Everything that today goes through wires will go through the air, and  everything that goes through the air today will go through wires.  Nick Negroponte

State of Data #65

#analysisDraw the correlation curve, *then* see what trend your line maps to.

A great series on A/B testing from 37signals – Part1, Part 2, Part 3 and concludes – “Big photos of smiling customers work


#architecture Analytic Data Management at Zynga (5 TB/day) and LinkedIn – Data is divided into two parts. One part has a pretty ordinary schema; the other is just stored as a huge list of name-value pairs. (This is much likeeBay‘s approach with its Teradata-based Singularity, except that eBay puts the name-value pairs into long character strings.) About half the data is in each part, but I don’t think that’s by deliberate choice.

#big_dataForrester defines ‘Big Data’ – ‘techniques and technologies that make handling data at extreme scale economical’ 


#Data_ScienceHow facial recognition can uncover the first 5 digits of your SSN

#DBMSNeat trick if you quickly want to gain performance on a long-running batch job (e.g., ETL) – reduce number of commits with just a parameter.

#idea Shower of Data’ from Seth Godin –new generation, one that grew up with a data surplus, is coming along…what always happens when something goes from scarce to surplus. First we bathe in it, then we waste it.”

Big Data Now’  from O’Reilly is now available FREE in Amazon Kindle

#visualizationMapping email closing lines



  • ‘The Theory That would not die’ — A History of Bayes Theorem – ‘Alan Turing used it to decode the German Enigma cipher; U.S.
    Navy to search for a missing H-bomb; to assess the likelihood of a nuclear accident; and .. used to verify the authorship of the Federalist Papers’ 

  • Half-life of a link – “The mean half life of a link on twitter is 2.8 hours, on facebook it’s 3.2 hours and via ‘direct’ sources (like email or IM clients) it’s 3.4 hours. So you can expect, on average, an extra 24 minutes of attention if you post on facebook than if you post on twitter

  • Alternative Leading Indicators – Big Mac index is so 2010.
    A reader from the pharmaceutical industry recommends tracking suppositories. “Financial worries and austerity changes in diet cause intestinal disorders,” he says, and sales of suppositories therefore rise as the economy goes down the pan.’
  • Which ‘p’ is which in statisticsYou just get used to it and figure out which p is which from context. It reminds me of George Forman naming all five of his sons George

State of Data #64

#analysisIt’s official – #1 reason (68%) customers leave a company because they believe the ‘company does not care about them’ – a study by Rockefeller Corporation suggests.

The man with the Golden Crystal Ball – ‘In May 2010 he predicted that Egypt’s president, Hosni Mubarak, would fall from power within a year. Nine months later Mr Mubarak fled Cairo amid massive street protests.

#architectureWhat language is ‘R’ written in? Mostly C; rest FORTRAN and R.

#big_dataData Wrangling for fun and profit’ is a growing ‘collection of tips and tricks for data work’.


Remember, the website is the API.

 – “Scaling up Machine Learning, The Tutorial” – Ron Bekkerman, Sr Research Scientist, Linkedin – presents in KDD 2011

#DBMSAnother good NCOUG presentation (Aug 18; PDF) on ‘Data Masking – How to address Development and QA teams’ 7 most common data masking related reactions and concerns’

#idea “Swimming in Data” – This USB-powered underwater sensor (e.g., aquarium) will detect temperature, pH and Ammonia levels and send alert to smart-phone.

#learningAwk is often very effective to manage data quality, especially with semi-structured (e.g., log) data. Following up on his highly popular blog series, Peteris Krumins releases ‘Famous Awk One-Liners Explained’ as a book ($5.95). Example –

Find the line containing the largest (numeric) first field.

awk '$1 > max { max=$1; maxline=$0 }; END { print max, maxline }'

#visualizationVisualizing Jane Austen’s leading character appearances in her novels