Thursday, September 3, 2009

The GMail Outage Autopsy

Apparently yesterday's problem with GMail was caused by some hardware being taken off line for routine maintenance but the remaining boxes met a spike in the message traffic and weren't up to the job so some of them closed down which threw more work onto the hardware that was left resulting in more of them closing down (Rinse and Repeat).

In looking at this issue I've learned three lessons :
  1. Obviously Google can't (and won't) adjust its upgrade schedules to avoid everyone's deadlines so these problems will continue to occur at inconvenient times. All that you can predict is that in future they will do their upgrades after 9pm USA time (which, of course, is prime working shift for the rest of the world).

  2. If Google's hardware redundancy is so bad that it can't handle a spike in message traffic when they're doing a PLANNED upgrade, then I don't see how they can hope to handle an UNPLANNED emergency.

  3. When you're using Google to search the internet for information on Google Applications, don't type the word 'GAPE' into your search bar and hit the 'I'm Feeling Lucky" button (don't do it! trust me on this one!).

The article concludes:
The problem was fixed once Google brought more routers online and spread the traffic among them. Google says it is tweaking its architecture so that the problem doesn't happen again.
Sounds like it needs more than a tweak.

1 comment:

Gavin Bollard said...

Every system, regardless of it's software, is at the mercy of it's networking hardware and the half-starved revolutionaries in third-world countries to whom it is outsourced.

Our own systems today experienced an outage (apparently detecting that I was taking 3 days of annual leave). Again, nothing to do with our software... our outsourced offsite server people dropped the ball (again)... and no, they aren't third world.