State of Data Last Week – #33

#analysis – Daniel Huffman filtered 1.5 million tweets from March and April 2010 and mapped the rate of profanity (12M PDF map) across America

#architecture –
Ron Bodkin founder of ‘Think Big Analytics’ discusses big data architecture. He raises some interesting patterns on analyzing cross-data center large data where fitting most data within RAM is not feasible. Also, ‘Data Scientist kind of guy’ needs to ‘notice anomaly’ rather than be just an expert on ‘statistical abstract reasoning’. Even better – ‘Small and big data are really a continuum’. Small data can use the practices and tools and tremendously benefit as well.

#big_data – launches live with 13,000 data sets, 100M time series, 600M facts including from UN, World Bank, Eurostat, Gapminder etc

#career – How to write a ‘noSQL CV’ 😉

Latest ACM issue ruminates about “System Administration Soft Skills” – all of it applies especially to other ‘backend’ technical jobs as well – DBAs etc.

#learning – Very nice SQL to Pig (Hadoop) reference cheat-sheet

#visualization –
Intuit releases “Small Business By The Numbers” visualization (larger version). Does it really take a day and $109 to open a start-up in New Zealand?




State of Data Last Week – #32

#analysis – How data can make you live forever – Scott Adams (@Dilbert) ponders if one could build a “smart analytical engine” that will learn and thrive on one’s online ‘persona’ data (email; tweets; blogs; comments; shopping; GPS) and once the person is dead can “represent” him/her in online communities and other digital places. One’s 23rd generation could then have a “live chat” with forefather dead 1000 years ago!

Come to think of it, this probably makes more sense – and easier to do than – Cryogenics.

#architecture – NoSQL at Twitter, video presentation on Twitter’s choice and usage of NoSQL technologies – Cassandra, Hadoop, Redis etc.

#big_data – Rumors of DBMS demise are greatly exaggerated’ (slides, if short on time)– IBM Distinguished Engineer Billy Newport discusses and contrasts the ‘SQL’ and ‘NoSQL mindsets’.

#DBMS – Database Administrator job requirement, probably written by a real Database Administrator – Do you think the world should be solved through relational algebra? Don’t apply here. Have you edited a database redo/write-ahead log in a hex editor to bootstrap a failed recovery? Now we’re talking.

#learning – A group of MIT researchers build a straw-man for DBaaS – Database-as-a-Service for cloud. The seminal paper analyzes cost and impact of multi-tenancy, elastic scale, privacy, partitioning and migration in cloud-based database.

#visualization – Mint does a simple visualization showing what Credit Card number digits mean.





State of Data Last Week – #31

#analysis – Easiest way to reach 99.99% availability SLA is to start counting (scheduled) downtimeGoogle just adopted the practice in Google Apps.

#architecture –
NoSQL tapes have videos of lectures on NoSQL from the actual practitioners. At this time, there are over 50 databases listed under ‘NoSQL’ in Wikipedia.

#big_data –
Excellent analysis of Megastores – Data Platform behind Google App Engine – it replicates data to 3 data centers; highly optimizing READ; with no more than 1 WRITE /Entity/Sec advised. This was interesting – “more than 99.9% of your writes are available for queries within a few seconds” – i.e., there will be at least some operations not immediately visible (inconsistent updates) due to CAP theorem. Some (financial) applications may not be suited for this.

Welcome Mumbai – ‘freeware Windows application targeted at Oracle DBAs’. It has popular “snapper” scripts integrated.

#learning – When is it worth to consider ‘column store’ database than traditional ‘row store’. Stonebaker presented 6 criteria – 3 on I/O and 3 on CPU – that make a whole lot sense.

#visualization – In a series of Cartogram, FedEx shows the changing world visually from its data – e.g., Europe becomes almost 3x of USA for ‘High-technology Exports’

  • Spooky, and beware – Is this a valid SQL – “select-1from from dual;with no space between ‘select’ and ‘-1from’? Yes. At least “Oracle does not need whitespace for tokenizing the SQL’.
  • Happy New Year for Operations too – On Jan 1, Twitter set a record of number of tweet per second (TPS) – World, mostly Japan – sent 6,939 TPS wishing friends and family
  • No time left – UPS was right minimizing left-turns for its 88,000 fleet to save fuel. Data shows ‘20% reduction in travel time’ if intra-city left-turns are replaced by right followed by U-turns.
  • California is Italy, Texas is Russia, Oregon is Pakistan. Each state visualized as a country’s economy.

    State of Data Last Week – #30

    #analysis – Social Data Mining to forecast economic crisis(PDF) – was 2008 really the first ‘financial crisis sparked by Big Data’

    #architecture – Machine vs. Human generated Data difference is simple –machine generated data scales linearly with computing power

    #big_data –
    Billion Prices Project by MIT tracks price fluctuations of >5M items sold by 300 retailers in more than 70 countries

    #DBMS – Historical Perspective of ORM and Alternatives (caveat – As he mentions, this person ranks pretty high on ‘orm bad’ Google searches ;-)

    #learning – Hans Rosling’s ‘Joy of Stats’ – the whole 59 minutes – is now online.

    #outage – 100% Data Recovery with unfortunate exception’ (!) from Dec 31 Hotmail outage

    #visualization – Interactive Map of Census data block by block ‘including indicators such as ethnic groups, income, housing, families and education’.

    Periodic table of HTML5 Elements