Amazon’s cloud outage – how it became stuck in a loop

29 Apr 2011

Amazon is the poster child for the cloud computing revolution in terms of its aggressive rollout of services. However, the highly publicised crash last week threatened to kill the cloud revolution at its very dawn.

Amazon today produced a summary explaining what exactly happened to its Elastic Compute Cloud (EC2) that affected the services of players like Quora and Foursquare and how they fixed it – effectively how the cloud became stuck and how it was made unstuck.

It explained the outage last week involved a subset of the Amazon Elastic Block Store (EBS), volumes of which cover specific regions – in this case the US east region.

When a network node in the EBS loses connectivity for whatever reason, it immediately searches for a new node onto which it can replicate its data in a process known as re-mirroring.

There is a set of control plane services that also accepts users’ requests and propagates them to the relevant cluster.

The process of events began on 1am on 21 April when a network change was part of Amazon Web Services’ normal activities in the US east region.

However, during the change a traffic shift was executed incorrectly and all the traffic was rooted onto the lower-capacity redundant EBS network and a large number of nodes lost their connection.

Because a large number of volumes were affected, free capacity on the EBS cluster was exhausted and many of the nodes became stuck in a loop searching for free space that wasn’t there.

Within hours, error rates and latencies increased and customers began to experience elevated error rates.

By 11.30am, engineers found a way of preventing the nodes’ fruitless search for servers and by noon the stuck volumes became ‘unstuck’.

However, this wasn’t the end of the issue as much of the data affected was still offline.

By the next day, restored volumes were fully replicated. Unfortunately, 0.7pc of the volumes in the US east region could not be restored.

The trigger for the outage at Amazon Web Services

“The trigger for this event was a network configuration change,” Amazon explained. “We will audit our change process and increase the automation to prevent this mistake from happening in the future. However, we focus on building software and services to survive failures. Much of the work that will come out of this event will be to further protect the EBS service in the face of a similar failure in the future.

“We will be making a number of changes to prevent a cluster from getting into a re-mirroring storm in the future. With additional excess capacity, the degraded EBS cluster would have more quickly absorbed the large number of re-mirroring requests and avoided the re-mirroring storm.

“We now understand the amount of capacity needed for large recovery events and will be modifying our capacity planning and alarming so that we carry the additional safety capacity that is needed for large-scale failures,” the company said.

As well as learning an important lesson in how to better manage what should be routine situations, Amazon said it learned a vital lesson in customer communications and says it is developing tools to allow its customers see via the APIs the health of their data.

“Last, but certainly not least, we want to apologise. We know how critical our services are to our customers’ businesses and we will do everything we can to learn from this event and use it to drive improvement across our services,” the Amazon Web Services team said.