State of Data #109

Top Read

What’s the phone number of the office where this picture was snapped from?” (There is NO other information other than this photo)

How a Google Scientist cracked it using search, data and common sense.

Analysis

Soccer is adopting “Moneyball” aka statistical techniques too 

 

Big Data

Colleges awakening to the Opportunities of Data Mining

“The new breed of software can predict how well students will do before they even set foot in the classroom. It recommends courses, Netflix-style, based on students’ academic records.”

Data Science

Using Twitter to Identify Psychopaths

‘Klout, which rates Twitter users’ influence, gave rise to Klouchebag, which rates their, um, “klouchiness.” This study could theoretically give rise to a “Klychopath” rating’

DBMS

SQL vs. NoSQL – analysis on a programming level

 

Idea

How to Visualize Data, say, U.S. High School Drop-out rates – if you do not have pen, paper, computer, software; say you just have the whole National Mall to yourself

Learning

10 Reasons Why We Visualize Data

Visualization

Visualizing a day on NASDAQ –  Part 1,  Part 2

 

etc

State of Technology #68

At Other Places

Architecture

Scaling Lessons at Dropbox – from 4,000 to 40,000,000 users

Code

How to make https sites REALLY secure

‘Another mitigation is to use scheme-relative URLs everywhere possible. These URLs look like //example.com/bar.js and are valid in all browsers. They inherit the scheme of the parent page, which will be HTTP or HTTPS as needed.

Fundamentally this is such an easy mistake to make, and such a problem, that the only long term solution is for browsers to stop loading insecure Javascript, CSS and plugins for HTTPS sites. To their credit, IE9 does this and did it before Chrome. ‘

Design

Weather App based on Dieter Rams’s 10 Principles of Design

Essay

‘On Mastery’ – How a Subway Sushi bar with ten seats could charge $600 a meal and win Three Michelin Stars

Mobile

How to Retinafy your Website – Flowchart

SaaS

Evolution of E-Commerce Checkout

Service

2012 Worst Security SNAFUs, so far

Social

Now Facebook Code Scans for Sexual Predators

Tool

For 237 years there has been no big innovation made for Flush Toilet. The stagnancy is taking its toll.

Hack

Lego Turing Machine that could compute

etc

Parting Thought

‘A software architect is the code janitor. Happily sweeping up after the big party is over’  – Chris Epstein on ‘A Software Architect’

State of Data #108

Top Read

EU Cookie/Privacy Laws: Implication on Data Collection & Analysis

 

Analysis

6 Numbers from Facebook Insights that Matter

 

5. Lifetime Negative Feedback

Facebook’s Definition: Lifetime The number of people who have given negative feedback to your post. (Unique Users)“

 

Big Data

Your Laptop Can Now Deal with Big Data

‘The new software, called GraphChi, exploits the capacious hard drives that are becoming ever more common in personal computers. A graph would normally be stored in temporary memory (RAM) for analysis. With GraphChi, the hard drive performs this task instead’

Data Science

 

How Economists get Tripped up by Statistics – how raw numbers lead to more spectacular errors

But here’s the thing: when the economists were shown both the graph and the detailed numbers, the number of economists getting the answer spectacularly wrong — the number giving an answer of less than 10 — soared. Just working with their eyeballs, 3% of economists got it wrong. Working with the numbers as well, that proportion rose to 61%! And when a third group was given the numbers and no chart at all, fully 72% of them — professional economists all — got the answer badly wrong.

 

DBMS

 

NoSQL: Not Only a Fairy Tale – Real life implementation in Ad-recommendation

‘Since we need to query external systems and we have a network roundtrip to our final response to the request, we only have little time processing the bids. We are speaking of ~25ms here to choose from hundreds of possible ads’

 

Idea

 

The rise of the Data Smuggler

“Two things changed my mind about why physically transporting data is interesting. A conversation with Sebastian Thrun (creator of Google Street View) that I had a few years back where he told me that Fedexing data is, and probably always will be, the highest bandwidth way of moving data around. That’s why Google uses Fedex to send hard drives from their Street View vans back to headquarters.”

 

Learning

Bamboo Charts

 

Visualization

 

Can Pie Charts help you save more?

 

etc

State of Data #107

Top Read

Nobody ever got fired for using Hadoop on a cluster’ (Microsoft Research)


“We analyzed 174,000 jobs submitted to a production analytics cluster in Microsoft in a single month in 2011 and found that the median job input data set size was less than 14GB… Facebook jobs follow a power-law distribution with small jobs dominating; from their graphs it appears that at least 90% of the jobs have input sizes under 100 GB. We therefore believe that there are many jobs run on these clusters which are smaller than the memory of a single server.”


Analysis

Highly insightful paper from Microsoft – how product measurement (e.g., A/B testing) often deceives

“Bing, Microsoft’s search engine, had a bug in an experiment, which resulted in very poor search results being shown to users. Two key organizational metrics that Bing measures progress by are share and revenue, and both improved significantly: distinct queries per user went up over 10%, and revenue per user went up over 30%! “

Big Data

7 Startups trying to solve your Big Data Problems


Data Science

Introduction to Data Science’ (free) Book

 

 

DBMS


SQLMap
: Automatic SQL Injection tool


Idea

How natural attractiveness of Normal Distribution makes people build elusive models for random, ‘Black Swan’ events. Or, why we made large-scale ‘financial crises’ unavoidable.

“Now for an abnormal question: to what extent is normality actually a good statistical description of real-world behaviour?  Evidence against has been mounting for well over a century.

In the 1870s, the German statistician Wilhelm Lexis began to develop the first statistical tests for normality.  Strikingly, the only series Lexis could find which closely matched the Gaussian distribution was birth rates. The natural world suddenly began to feel a little less normal.”


Learning

Everything you wanted to know about Machine Learning under 30 minutes – a talk from Hilary Mason

‘The talk is geared toward engineers with no prior knowledge of machine learning, and it’s designed to lay out the basic vocabulary and way that we think about the world to provide an amusing foundation. This talk is not an in-depth tutorial.


Visualization

The Blue Economy – Visualizing Fishing, Transport, Energy & Cities


etc

State of Technology #66

At Other Places

·                     ‘My boss decided to add ‘Person to Blame’ field to every bug report. What to do?’

·                     Leap – ‘Own the Future. $70’. Minority Report Deployed  

·                     HowApple keeps secret so well– from the wife of an Apple Engineer

·                     Ford creates a clever app that will let youlogin to apps without keying in anything

·                     A day in the life of CTO

Architecture

Single Page App Book – ‘the focus is on discussing patterns, implementation choices and decent practices’

 

Code

A Gentle Introduction to Algorithm Code Complexity Analysis uses Big-O notation and provides some usable Rules of Thumb to help high school students to code better. It’s a good quick read for all, however. 

 

Design

Is Plastic Clamshell Packaging the worst piece of design every done?

‘It’s been the cause of thousands of emergency room visits, and there’s even a Wikipedia-approved term to describe the frustration you feel when confronted with an unrelenting piece of plastic between you and your product.’

Essay

How Complex Systems Fail was written by a doctor who investigated Operating Room errors, avoidable patient deaths etc for living. This short-piece was written more than a decade ago but is still as insightful.

Mobile

 

The Mobile Playbookfrom Google

 

SaaS

 

Dynatrace announces very exciting SpeedoftheWeba free benchmarking and optimization service

 

Service


Apple’s minimum viable product
was a motherboard!

 

Social


If a
smart algorithm creates a speech, is it considered ‘Protected Speech’ covered by First Amendment?

 

Tool


Side view mirror that eliminates Blind Spot

Hack

Rinser Toothbrush is good for people with chronic back pain

 

 

etc

·                     Song for Tau Day

·                     25 Really Clever ideasto make life easier

 

Parting Thought

Final Idea from Peter Thiel’s Stanford Startup Class

State of Data #106

Top Read – America Revealed (through Pizza Delivery)

‘The fantastic PBS miniseries America Revealed, which explores the hidden patterns and rhythms that make America work,” makes stunning use of data-viz techniques to stimulate the eye-candy part of your brain while teaching you something.’

Analysis

“I analyzed the chords of 1300 popular songs for patterns. This is what I found”

Big Data

‘The Measured Man’

“Have you ever figured how information-rich your stool is?,” Larry asks me with a wide smile, his gray-green eyes intent behind rimless glasses. “There are about 100 billion bacteria per gram. Each bacterium has DNA whose length is typically one to 10 megabases—call it 1 million bytes of information. This means human stool has a data capacity of 100,000 terabytes of information stored per gram. That’s many orders of magnitude more information density than, say, in a chip in your smartphone or your personal computer. So your stool is far more interesting than a computer.” 

Data Science

Inside Netflix Recommendations – Part 2: How they now use social data (what your friends are watching) to recommend stuff

 

DBMS

A really, really brilliant post on UTF8 and Character sets. This explains ALL!

 

 

Idea

Behavio – from MIT media lab – aims to turn Smart Phone data into genius

Learning 

Chart Chooser – do you want to use a chart for Comparison or showing Trend? To draw relationship in Powerpoint?

 

Visualization

Blame Twitter is a mock visualization, but teaches how ‘tortured data can confess to anything

 

etc