Gmail outage blamed on server miscalculation

2 Sep 2009

For many people who depend on Gmail it must have seemed like the earth stood still … for two hours anyway as the service experience a very rare outage.

Yesterday between 12:30 p.m. PDT to about 2:30 p.m. PDT (that’s 8.30pm GMT to 10.30pm GMT here) Gmail experienced an outage and millions of people who depend on the service for running their business, staying in touch with loved ones or backing up data could not access the service.

It is understood that the problem occurred when Google took a number of Gmail servers offline for maintenance.

However, minor changes to the routers that direct Gmail traffic to servers backfired resulting in millions of users being unable to access the popular service.

Writing in the Gmail blog Ben Treynor, vice president of engineering and ironically entitled ‘Site Reliability Czar’ explained what happened.

“This morning (Pacific Time) we took a small fraction of Gmail’s servers offline to perform routine upgrades. This isn’t in itself a problem — we do this all the time, and Gmail’s web interface runs in many locations and just sends traffic to other locations when one is offline.

“However, as we now know, we had slightly underestimated the load which some recent changes (ironically, some designed to improve service availability) placed on the request routers — servers which direct web queries to the appropriate Gmail server for response.

“At about 12:30 pm Pacific a few of the request routers became overloaded and in effect told the rest of the system ‘stop sending us traffic, we’re too slow!’. This transferred the load onto the remaining request routers, causing a few more of them to also become overloaded, and within minutes nearly all of the request routers were overloaded. As a result, people couldn’t access Gmail via the web interface because their requests couldn’t be routed to a Gmail server. IMAP/POP access and mail processing continued to work normally because these requests don’t use the same routers.

“The Gmail engineering team was alerted to the failures within seconds (we take monitoring very seriously). After establishing that the core problem was insufficient available capacity, the team brought a LOT of additional request routers online (flexible capacity is one of the advantages of Google’s architecture), distributed the traffic across the request routers, and the Gmail web interface came back online,” Treynor said.

Treynor said Gmail has taken actions to ensure the outage does not occur again. These include increasing router capacity well beyond peak demand and ensuring problems in data centres can be isolated without denying access.

“We’ll be hard at work over the next few weeks implementing these and other Gmail reliability improvements — Gmail remains more than 99.9pc available to all users, and we’re committed to keeping events like today’s notable for their rarity.”

By John Kennedy