State of Data #105

Top Read – What Facebook Knows

Great insight into Facebook’s Data Science team a kind of Bell Labs for the social-networking age. The group has 12 researchers—but is expected to double in size this year. They apply math, programming skills, and social science to mine our data for insights’


Analysis

Using Google BigQuery and QlikView to build snappier dashboards

Big Data

Viewing 100 trillion relationships within Wikipedia through Big Data lens

Data Science

“What are some must-know tricks in field of data science that most people are oblivious to?” gets “Data Munging” as the most voted answer

DBMS

SSD prices are half of what it was a year ago! Amazon, Dropbox, <name your favorite> are using it as primary storage

Idea

Integrate data into Product, or Get Left Behind

“Coors has informationalized the beer can. The mountains turn blue when the beer is cold enough to drink.

Learning

How to unmask phony online reviews with statistics  (Original paper)

 

“Real reviews naturally produce a pattern that looks like the letter J. i.e., it should have a relatively high amount of one-star reviews, fewer twos, threes, and fours, and then a high number of five-star ratings. But phony reviews distort this normal shape”

Visualization

International Data Journalism Excellence Awards notable winners –


etc

     

Advertisements

State of Technology #64

At Other Places

Architecture

Is JavaScript ‘enterprise enough’? The big debate and the Pew Pew Manifesto that started it all.

Code

Average web page is 1MB and requires 80 sub-resource requests to be made – How Chrome optimizes it by DNS Pre-fetching and TCP Pre-connect

“The “browser startup experience” has its own separate cache where Chrome learns the first ten visited URLs across all of your sessions. Whenever you do a fresh boot of your browser, it immediately resolves all of those hostnames – a nice optimization to speed up your morning routine!”

Design

Frog redesigns Android User Experience

 

Essay

The Slow Web

Mobile

Long SMS, 160 Characters Limit and how to break it

SaaS

Web API Design – Crafting Interfaces that Developers Love’ – Free eBook

Service

Taxiback ‘offers a service that connects passengers with cabs that would otherwise return empty’

Social

Why do Nigerian scammers say they are from Nigeria? Per Microsoft researcher they are not looking for ‘anyone but most gullible’.

Tool

Free Online Graph Paper Generator (ed. favorite: Storyboard paper)

Hack

Make your own standing desk with only $22

etc

Parting Thought

“Most business meetings involve one party elaborately suppressing a wish to shout at the other: ‘just give us the money’.”  – Alain de Botton

State of Data #104

Top Read – Casino’s Data Cloud

“Let’s say we want to see the profitability of females fifty-five and older. Who are these ladies? Where do they live? How can we target them better?” The representative showed an animated map of an unidentified city, titled “ground floor, little old ladies, carded play time.” As the clock in the upper left-hand corner spun, the city flared and pulsed with color, registering the home addresses of older women gamblers as they began and ended sessions of machine play on the ground floor of one casino over the course of a day.

Analysis

Why, statistically speaking, Bank robbery is a really bad idea


Big Data

Crunch commuter data to find ‘a community’s prosperity is reflected in the comings and goings of its residents’ (original paper)

 

Data Science

Large scale Machine Learning at Twitter (Original Paper)
“There are three main components of a machine learning solution: the data, features extracted from the data, and the model. Accumulated experience over the last decade has shown that in real-world settings, the size of the dataset is the most important factor. Studies have repeatedly shown that simple models trained over enormous quantities of data outperform more sophisticated models trained on less data.”

 

DBMS

2-page, brief introduction to Eventual Consistency (pdf)

 

Idea

Learn Pi with Matchsticks

 

Learning

Doing the Line Charts Right with Best Representation of Facebook’s IPO day

 

Visualization

Civil War visualized with Python

‘In this view, which my friend Evan Sheehan has termed “ball pit visualization,” each battle would have been represented as a pie chart dropped into a “pit.” Each pit would correspond to who had won the battle.’

etc

 

State of Data #103

#1Read_this_Week

 


#analysisData informed vs. Data driven – including how Facebook uses data to optimize multiple photo uploads

 


#big_data
–  How Facebook keeps 100PB Hadoop Data Online


#Data_ScienceHow to drink the right Single Malt

 


#DBMS
Seven databases in a song – ‘PostgreSQL, Riak, HBase, MongoDB, CouchDB, Neo4J and Redis in the style of My Fair Lady’

 

#idea – Information is like food, like food it should come with “Percentage of fat Fact in it

 

#learning –   10 things that influenced infographics legend Nigel Holmes the most

 

#visualizationVisualization techniques for Time-oriented Data – A goldmine!


#etc

 

State of Technology #62

#at_other_places


#architectureMechanics and meaning behind ol’ Dial-up Modem sound



#code
Entire source code for Doom3 has been released


#design
“Can We Please Move Past Apple’s Silly, Faux-Real UIs?”

#essayInside story of the death of Palm and webOS

 

#mobileiPhone fonts should be a tad smaller than iPad fonts. Why? Because one holds iPhone closer!

#saasDo hashbangs (#) belong in the URLs?

#service20 Lines of Code that will beat A/B testing

 

#social – A place where deleted tweets go to .. live forever after

 


#tool
Are you getting tired at work? Here is a comparison of the best standing desks available.

#tweaks n’ hacksBuild a complete website on iPad

 

#etc

 

#parting_thought Simplicity is not the absence of clutter, that’s a consequence of simplicity. Simplicity is somehow essentially describing the purpose and place of an object and product. The absence of clutter is just a clutter-free product. That’s not simple.” – Jonathan Ive

 

State of Data #102

#1Read_this_Week The Girl with the Dragon Tattoo’ makes a SQL mistake – ‘My favourite part, though, was when Lisbeth Salander begins to solve a 40 year old murder cold case using SQL.’

 

#analysisA good series on ‘Moving Beyond Conversion Rates’ –

Part 1: Avoid Ratios for Metrics
Part 2Not All Visitors Make Great Customers
Part 3Visitors Are Not All The Same
Part 4Campaigns Are Where Conversion Rates Shine

 

#big_dataDataPop’s story –  startup that ‘relies on semantic search and natural-language processing to infer connections between what consumers enter into the search window and what they really want, and then on machine learning to help with everything from determining common spelling mistakes to search construction to the sequence of events that leads to a purchase’

 

#Data_ScienceHow to predict Eurovision Song Contest winners

 


#DBMS
ROW vs. SET processing of records

 

“1. Set based processing will likely be much faster than row based processing. Our experiment of processing 100K rows showed row based processing was 3700 times slower than set based processing. Not twice as slower or even 10 times slower… 3700 times slower!

2. Compared to set based processing, row based processing times degrade much quicker than set based processing. That is, row based processing does not scale nearly as well as set based processing. We saw this in that the linear trend line for row based processing was 0.00259 compared to 0.00000 for set based processing”

#idea Nassim Taleb, author of bestseller ‘Black Swan’, writes about ‘side effects’ of too much reliance on data

‘In business and economic decision-making, data causes severe side effects —data is now plentiful thanks to connectivity; and the share of spuriousness in the data increases as one gets more immersed into it. A not well discussed property of data: it is toxic in large quantities —even in moderate quantities.’

#learning –   

  1.  Pros and Cons of Scatterplots
  2. Case for using a density plot instead of a scatterplot

#visualizationWhat’s wrong U.S.” is visualization of Target and Walgreens’ sales data to see which state has most headache problems or allergy or stuffy nose

#etc 

State of Data #101

#1Read_this_Week#QuantifiedSelf Body Weight Simulator – Use Data-driven models to take charge of Health (hit “Start Simulator”)

 

#analysis – ‘Wake me up when the data is over’. Why you should use images instead of Raw Data – it is much less cognitive overload!

 #architectureBig Data Architecture at LinkedIn                                                        

 

#big_data“R” tutorials from Universities around the world

 

#Data_ScienceVery short history of Data Science 



#DBMS
Skeuomorphs, Databases & Mobile Performance



#idea
Big Data for the Rest of Us – “The wine industry has been ruled by small data for the longest time”. How a startup changed the game.



#learning
Plethora of Selected Data Visualization tools that you can even use in your web/mobile apps directly!



#visualization
#ContrastNCompareThis tiny sphere is ALL the World’s Water

#etc