State of Technology #22

#at_other_places – 

·  2011 Gartner “Hype Cycle” for technology is out –

IN – Group Buying; Mobile Robots; Video Analytics for customer service; Image Recognition; Big Data
OUT – Cloud Platform; Hosted Desktop;E-readers; In-memory Database


· Bootstrap, toolkit from Twitter to quickly build apps and pages

· jQuery overtakes Flash in most popular websites

#architectureHow to “close”candidates” – persuade to join your team once interviewed –

1) Figure out what is the candidate’s dream.
2) Determine if job and candidate are the right fit.
3) Communicate your own passion.

#codeJava Persistence API – A Quick Intro


#designSteve Job’s Best Quotes – Wall Street Journal compiles

“Design is a funny word. Some people think design means how it looks. But of course, if you dig deeper, it’s really how it  works. The design of the Mac wasn’t what it looked like, although that was part of it. Primarily, it was how it worked. To design something really well, you have to get it. You have to really grok what it’s all about. It takes a passionate commitment to really thoroughly understand something, chew it up, not just quickly swallow it. Most people don’t take the time to do that.” 

#essayWhy Software is eating the World” – from Mark Andreessen

“First of all, every new company today is being built in the face of massive economic headwinds, making the challenge far greater than it was in the relatively benign ’90s. The good news about building a company during times like this is that the companies that do succeed are going to be extremely strong and resilient. And when the economy finally stabilizes, look out—the best of the new companies will grow even faster.

#mobile8 best practices for Deploying a top Mobile App” – it is (mostly) about first Two Weeks!


How to use  UTF-8 Throughout your Web Stack

UTF-8 is extremely sane. Well, as sane as an encoding can be that features backwards-compatibility with ASCII. Everything you care about supports UTF-8. Trust me:you want it everywhere.

#socialInnovation in developer/customer  support – Facebook Hooks up with Stack Overflow

#toolA great tutorial / write-up on ‘How Browsers Work’ (from HTML5Rocks) –


#tweaks n’ hacksSSH can do THAT? Very useful productivity tips for working with remote servers.


“It’s more fun to be a pirate than to join the navy.”Jobs 

State of Data #63

#analysis – 46 page Internet Marketing Strategy “briefing looking at customer centricity, channel diversification, data, social media and content strategy. This is their usual high grade quality and worth a look”

Disdain Data Diving – “Today’s Big Data heavy-lifting machines and software systems were built back in the day when millions of customers made millions of phone calls and each one had to be captured, stored, and found in a heartbeat. Banking and credit card transactions by the billions had to be put into safekeeping somewhere they could be added up, averaged, and recalled if need be.

#architecture –  MongoDB loves BSON (Binary JSON) for Data Exchange —

“Fast scan-ability. For very large JSON documents, scanning can be slow. To skip a nested document or array we have to scan through the intervening field completely. In addition as we go we must count nestings of braces, brackets, and quotation marks. In BSON, the size of these elements is at the beginning of the field’s value, which makes skipping an element easy.

#big_pig_data – Angry Birds is played 1.4B minutes a week. Now, they have tied up with a predictive analytics solution provider to help forecast pig smashing abilities.

#Data_Science –   Multiple packages in R to read online datasets 

 – A phenomenal paper from NoCOUG on ‘NFS Tuning for Oracle’ (PDF) by Kyle Hailey. 

 #idea – Facebook engineer suggests reducing disk RPM to reduce data center power cost


Item Value
Normal Speed 7200 RPM
Reduced Speed 3600 RPM
State Transition (triggered by an OS command) 15 seconds
Normal Idle Power 7W
Reduced Speed Idle Power 3W
Normal Bandwidth ~100 MB/s
Reduced Speed Bandwidth >10 MB/s
Normal Latency ~10 ms
Reduced Speed Latency <100 ms



 #learning – What every Data Programmer Needs to know about Disks (PPT; from OSCON 2011) – very highly recommended especially for ‘Why EC2 I/O is Slow and Unpredictable’ –

Newer intel chips have the northbridge controller on-die. Southbridge bandwidth is usually <= 10GB/sec, and you are sharing this with other customers’ network and disk I/O. That, and you may be sharing drive spindles.



#visualization – Stanford’s ‘Republic of Letters’ visualization – “on database of thousands of letters exchanged between prominent intellectuals in the 17th and 18th centuries” – is made on HTML5. Has connections, volume and flow views of over 55,000 letters exchanged among 6,400 correspondents.



State of Technology #21

 #at_other_places –

o         Dear Photograph;
o         Proust;
o (free ‘webex’ with no registration);
o         FreeRice (You learn; hungry people get to eat)

 #architecture – How to retire a great Interview problem – “word break” problem described as —

Given an input string and a dictionary of words, segment the input string into a space-separated sequence of dictionary words if possible. For example, if the input string is “applepie” and dictionary contains a standard set of English words, then we would return the string “apple pie” as output

#codeLearn JavaScript on the fly from CodeAcademy is really, really effective and smart way to learn. No registration required

#designThe man who designed ‘Like’ 

Some of Facebook’s look was inspired by the videogame look of the 1980s. “Back then, the aesthetic had a very limited color palate relative to videogames today. Everything is a bit blocky, without smooth surfaces,” he said. Yet, “there is a level of artistry in videogames that is unparalleled.”

#essay – What 8 things Susan Wojcicki learned about innovation as employee #16 – among other principles – “Never_fail_to_fail” and “Spark_with_imagination, fuel_with_data” –

“.. technology for driverless cars to reduce the number of lives lost to roadside accidents each year. These cars, still in development, have logged 140,000 hands-free miles driving down San Francisco’s famously twisty Lombard Street, across the Golden Gate Bridge and up the Pacific Coast Highway without a single accident. 

P.S. Not anymore without accident! 

#mobile – iPhone component cost is $178 – Samsung alone gets about $45 of it; Apple’s slice is $378 

#saas – Sign of things on SaaS delivery – Firefox removes version number

#social – Drug companies lose special protection…on Facebook 

#tool – Step-by-step guide to find JavaScript memory leaks; including actual memory leak problem analysis from Facebook. 

#tweaks n’ hacks – Data Sandals won’t probably rock the fashion scene any time soon. But…



#parting_thought – “When you’re young, you look at television and think, There’s a conspiracy. The networks have conspired to dumb us down. But when you get a little older, you realize that’s not true. The networks are in business to give people exactly what they want. That’s a far more depressing thought. Conspiracy is optimistic! You can shoot the bastards! We can have a revolution! But the networks are really in business to give people what they want. It’s the truth.”

— Steve Jobs

State of Data #62

#analysisHotmail product usage data analysis and how it influences the design —

“three types based on their behavior—Filers, Pilers, and Deleters..
Deleters generally delete email after it arrives. Deleters receive an average of 211 email messages each week and end up deleting almost 80% of them.. The mantra for these people is, “My kitchen has to be clean before I start cooking.

Filers put nearly half of their email (44%) into folders immediately after it arrives.

Pilers receive the least amount of email each week (174 messages). But that means they still receive an average of 9,048 email messages per year. Because most of those messages (57%) never leave the Piler’s inbox, their email starts to pile up


Google has started certification on Analytics with detailed “Analytics IQ Lessons” culminating in an exam

#big_dataWhole controversy around KissMetrics Data Collection practices and their official response to the allegations

#conferenceACM Data Mining Camp, October 2011 – “local, cheap, and high-quality learning opportunity”   

#Data_Science –  Verifying Benford’s Law on Tweets  – it works! 

#DBMSMost Big Data engineers mention ‘performance’ as the #1 priority. ‘3-minute test: What do you know about SQL Performance’ lets you figure out strengths, choose between MySQL; Oracle; PostGres; SQL Server and hammer out.


– Are we becoming too analytical? Serious introspection to be self-aware of possible ‘bandwagon effect’ of ‘big data’ and ‘analytics’–

“But the biggest reason I believe these two products have not taken off is their reliance on the belief that simply giving people their data and letting them analyze it is the way to improve behavior (both for health and for the environment)

One of the first things we teach in introductory human-computer interaction (HCI) is that “you are not your user” and “beware designer ego bias.” Google seemed to have fallen into this well-known trap in their design and testing for Google PowerMeter (and perhaps Google Health).”


#learningStanford University courses on Data – FREE for Fall, 2011, requires about 10 hrs of work a week per course; class begins on October 10 –


#math/statHow likely is it for a telephone number (w/o area code) to be prime? About 6%. With area code it may be somewhere around 4%.


#visualizationDichotomy or Difference? Statistical Graphics vs. Information Visualization – two crisp articles in most recent ‘Statistical Computing and Graphics Newsletter’ (PDF) discuss it from POVs of Computer Science and Statistics.   Follow-up from Andrew Gelman is interesting too.






State of Technology #20


 #at_other_places – 

·         Wolfram released CDF (Computable Document Format) based on ever popular Mathematica

·         Browser is the new tablet – Amazon’s Kindle-on-Cloud reads GREAT

·         Not AAA rating, it’s AAPL

#architecture – Is ‘Open Office Layout’ bad for brain and good for bugs?
Very interesting debate forming out there –

Jordon quoted Joel Splosky when he mentioned that open-office layouts and the similar concept of war rooms are the places where bugs are bred. According to him, in such settings, no-one can concentrate for long due to constant interruptions and distractions.

#code – Absurdity of some software patents –


  1. ‘someone’ patented Linked list
  2. and patented ‘Electronic shipping notifications’ too

#design – The best ever education on password strength

#essay – 
GPS is changing our brains faster than we think –


There is an idea popular in technophilia, dating back at least to Marshall McLuhan, that some technologies may be considered an “extension” of our own minds or selves. Scott Adams, sounding not unlike the drones who spin corporate techno-jargon in his comic strip Dilbert, has said just such a thing about GPS devices, claiming that they are part of our “exobrain” (and that this means that “technically, you’re already a cyborg”). It seems a rosy picture with a rosy appeal: GPS gives us additional abilities in physical space; therefore it extends our abilities into space; therefore it is an extension of us, or of our minds or brains. More precisely, as Adams puts it, “your regular brain uses your exobrain to outsource part of its memory, and perform other functions.””

#mobile – 
Not disk capacity; RAM capacity; of course not number of transistors – Battery charging time has improved the LEAST (PDF) over last few decades. iPhone can only accept ~2.5 Watts while charging, humans generate 100 watts running on treadmill – can we hook up iPhone with ourselves and charge? 


#social – Analytics to replace counseling?! 



#tool – ‘Electronic tattoo’ – otherwise known as ‘Epidermal Electronics’ — could potentially be dangerous for privacy. Right now, it’s good for science. 


#tweaks n’ hacks – Useful ‘how-to’ algorithms to enhance images – how to ‘beautify’ a face; change B&W to Color


  • Why are restaurant websites so awful?

    Restaurant sites are the product of restaurant culture. These nightmarish websites were spawned by restaurateurs who mistakenly believe they can control the online world the same way they lord over a restaurant. “In restaurants, the expertise is in the kitchen and in hospitality in general,” says Eng San Kho, a partner at the New York design firm Love and War, which has created several unusually great restaurant sites (more on those in a bit). “People in restaurants have a sense that they want to create an entertainment experience online—that’s why disco music starts, that’s why Flash slideshows open. They think they can still play the host even here online.”


#parting_thought – “Some very considerable part of the gestural language of public places, that had once belonged to cigarettes, now belonged to phones” – William Gibson in ‘Zero History’ 



State of Data #61

#analysisWhat does that “Register” button cost you? It cost one e-tailer $300M/year as “fastest way to alienate those customers and scare away that free money is to make its owner establish a relationship with you before s/he can make a purchase”.

Consumer:Creator ratio – 1M:1 (50 years ago) to 100:1 (Etsy era) 

Very detailed tabular comparison of Top 6 “Cloud Computing” services (PDF) – AWS; GAE; Azure;; RackSpace and GoGrid


#big_data(Greenplum + SAS) vs. ($5K hardware + R Enterprise) – the latter ran logistic regression on 1 Billion records in 75 seconds – “ just as fast, and at less than 1% of the hardware cost

#Data_Science –   Machine Learning on Big Data – Lessons Learned from Google Projects. E.g., how do they render the ‘best guess’ in the following search?


#DBMSMythbusters: Stored Procedures Edition – agree or disagree, worth a read.



#learningBell curve (or, normal distribution) is not just a math thing, it is naturally ubiquitous. Watch out for it in door wear patterns (why would the left door wear distribution sit above the right door – this editor has a theory. Hint: which hand most would carry goods getting out of a store?)


#visualizationEver think what the real color of summer would be? Or, of Thursday? “using simple algorithms on data originating from subjective human perceptions — system created to find out the colour of anything, by querying and aggregating image data from Flickr”


  • Would you choose a different number if asked for ‘favorite number’ than ‘random number? Most people intrinsically like Prime numbers. Help uncover world’s most ‘favorite number’ 

  • Backup 1: Chaos 0 – Make Data ImmortalStartup claims a DVD form-factor storage that “you can dip it in liquid nitrogen and then boiling water without harming it” 
  • Backup 1: Chaos 1 – ‘Hard Disk Crusher” – a ‘new spin on destruction’. Economist writes – “A baseball bat might have been more liberating, but the hydraulic crusher’s surgical precision nonetheless holds a certain charm.

Drive or Fly from SFO to LAX – DBMS or noSQL for your Transactional App?

Clarke’s 1st Law – “When a distinguished but elderly scientist states that something is possible, he is almost certainly right. When he states that something is impossible, he is very probably wrong.”

“The NoSQL movement is a lot like the Ron Paul campaign – it consists of people who are dissatisfied with the status quo, whose dissatisfaction has a lot to do with insufficient liberty and/or excessive expenditure, and who otherwise don’t have a whole lot in common with each other.” –    Curt Monash

“The computer industry is the only industry that is more fashion-driven than fashion. Maybe I’m an idiot, but I have no idea what anyone is talking about. What is it? It’s complete gibberish. It’s insane. When is this idiocy going to stop?“ –         Larry Ellison

Technologists have it real tough. We cannot hide our face in the sand from slew of new (often very meaningful) technologies. Neither can we be Don Quixote and embrace change purely for the sake of change.


If I saved $1 for every time I was asked “Should my app move on to noSQL?”, the brutally bruised 401K would surely look just about OK now. At the end, the easy way to decide for / against immediate change is simple and mathematical.

Change if (the cost of change) < (the cost of doing nothing).

Based on that, here is a brief guidance that should, hopefully, help you making an informed judgment on choosing the data management strategy for your particular application. I made things simpler than it should be to underscore the pattern.

A typical 3-tier transactional system spends about 80% of its total processing needs within database.

Within database, the time division is typically as follows – 80% time within DB is spent to what we could call “connection” overhead – with multiple SQLs fired over JDBC etc; N+1 type mapping issues; un-optimized OR-patterns etc. Rest 20% is somewhat equally distributed in four categories of latency, irrespective of RDBMS vendor. It is easy to (almost) fully get rid of the “connections” overhead by hand-coded optimizations, or – in extreme cases – by putting things in stored proc.

Assuming (a) stored proc runs the SQL; (b) data is already cached in memory, and (c) access path too has been determined in previous runs, a typical query returning few records should really take about 40ns to run. But it takes about 1~2 ms to run because of —

  1. Logging (mainly systemic logging) – 18%
  2. Locking (mostly to insure “A” of ACID) –  20%
  3. Latching (a low-level locking, to insure “C”) – 20%
  4. Buffer Management – 35%
  5. Actual Work – 7%

Roughly, if your application is creating 10M invoices a day @ 0.5 sec end-user-latency/invoice


  • 5M sec of total latency/day
  • 4M sec of DB latency/day
  • 3.2M sec of latency/day attributed to “connections” (firing more SQLs than needed; bad mapping; unavoidable mapping; denormalization etc)
  • 744,000 sec of “overhead” to pay for (mainly) ACID; vendor features in form of Locking; Logging; Latching and Buffer Management
  • 56,000 sec of *actual DB processing*

Thus, the “trade-off” analysis to hand-over to “noSQL” arises if –

  1. The “overhead” to pay is the major component. Say, rather than a typical 15%, the “overhead” is 50%
  2. The reasons for “overhead” no longer apply much (the system can ‘execute and forget’ – no need to lock, log, multi-user access, security, auditing etc)
  3. “Connection” (i.e., JDBC, ODBC, Data Transfer, Amount of Code executing) has been taken care of either via
    1. Hand-coded optimizations
    2. Stored-proc like “centralized” modular processing

It is a lot like to decide whether to fly from SFO-Los Angeles or drive. The pure flying time (“actual processing work”) is about 90 min, but from (home to SFO airport) + (LAX to Hotel) + security could be 3 hrs. An executive of a company could rent a charter flight and decide to get rid of the security check-ins etc (“lock, latch, log”) – but for general purposes we commoners just bear the “overhead” hoping the security scanners will do the work to make us safer at the end.



State of Technology #19

 #at_other_places – 

  • New CDN service from Google
  • Java 7 ships. Caveat – a very serious bug has already been discovered.
  • If Tim Berners-Lee had patented the World Wide Web, we’d be able to use it freely starting today. Web turns 20.
  • On Monday, Adobe announced Edge – for developing interactive content and animations using the open Web standard HTML5. Interesting competition with Flash

#architecture – Architecture of Peecho – “If You Are Slow, you Can’t Grow”

#code – Is this the most expensive one-byte mistake?

the C/Unix/Posix use of NUL-terminated text strings. The choice was really simple: Should the C language represent strings as an address + lengthtuple or just as the address with a magic character (NUL) marking the end? “

#design – Why users fill out forms faster with Unified Text Fields? Insight – do not let their eyes move across, keep a smooth flow

#essay – How Bob Dylan influenced Mac design – from Andy Hertzfeld of Google+

#mobile – 
IEEE announced new wireless standard 802.22 that can cover up to 12K miles Wireless Regional Area Networks (WRAN) 

#saas – Why Firefox handily beats Chrome if you have multiple tabs open

#social – Mobile Payments are in a mess 


#tool – Automatically mute background tabs – should practically be default behavior for every browser.


#tweaks n’ hacks – Tom’s Hardware proved that “data is not plural for anecdote” – very counter-intuitive – as leading edge stuff often is. We may even decide to choose SSD – even if not for speed, but as it “fails less”. 



#parting_thought – “When you have exhausted all possibilities, remember this: you haven’t”
– Thomas Edison on effort

State of Data #60

#analysis – Using R and Motion Charts to analyze financial data 


#architecture – Build ‘Just in time, not Just in case’ – Twitter’s ‘Data Architecture 1.0’ did not contain many of intuitively obvious strategies (sharding, primary key partitioning etc) – “Big Data in Real-Time at Twitter”. If it were not from Twitter, a tenured technologist might even have scoffed at it – but, hey, they got it done eventually.

#big_data – Facebook, 37Signals, Twitter all did it. But Ebay’s swift and successful deployment of 100TB SSD shows that “singularity” has been reached for Big Data.

Past and present ‘Big Data’ strategies are based on the paradigm of ‘disk access is expensive, avoid it if you can’. That would be morphed, re-visited and often even ignored by businesses as disk access would become 10x faster in about 18 month for the same price-reliability ratio as of now.
After replacing 100TB of storage in a year, eBay saw a 50% reduction in standard storage rack space, a 78% drop in power consumption and a five-fold boost in I/O performance. That speed boost now allows eBay to bring a new VM online in five minutes, compared to 45 minutes previously.


#DBMS – Save your B, C, D (Business, Customers, Data) from A (Anon attacks)– an excellent pocket reference on ‘SQL Injection’ – for MySQL, Oracle, MSSQL

#learning – What is the answer to every question in the world?

(a) 42;

(b) It Depends
Indexing the WWW – The Journey so Far’ is a must read for understanding the nuanced trade-offs between supposedly obvious strategies (say, memory-based indexing) vs. typically not the first-choice on coffee table voting (say, disk-based indexing). At some boundary, every solution stops working as advertised. The trick is to find out the extremity gap of the boundary from present business needs.

#visualization – Who uses more storage? Manufacturing wins hands down (Hat tip – Sharat Israni)



  • Economics of Keywords in Adwords – Looking at Google keywords cost analysis, web looks like a giant engine mainly used to insure (car, health, cord blood), claim and borrow money.
  • Trivial Pursuit of Happiness – How a hastily thought levity of ‘Big Mac Index’ gets economic pulse better than many other extremely well thought of indicators.
  • 27,000? How many English words you know? Does it vary between native and non-native speakers? Climate? Statistically, it is easy to find out if you spend about 6 minutes here.
  • Simplicity Wins – Eventually, longer words get obliterated lost. Simple ‘X-ray’ killed ‘Roentgenogram’. What do we learn from word usage histogram after getting access to 4% of books ever printed? Does mobile spell checker nudge people to use ‘canonical and shorter’ words?
  • Google uses 0.01% of World Electricity thanks to the 900,000+ server fleet