Advertisement

Amazon.com apologizes for multi-day cloud computing outage

Share

This article was originally on a blog post platform and may be missing photos, graphics or links. See About archive blog posts.

Amazon.com issued an apology Friday for its multi-day cloud computing outages last week.

The outages, which struck April 21 and ran through Sunday, left many popular social media sites shut down while the Seattle-based company worked to get its Amazon Web Services servers back in order.

Amazon’s apology came at the end of a 5,679-word letter that explained what caused the temporary failure and said affected customers would have a 10-day service credit automatically added to their accounts.

Advertisement

‘Last, but certainly not least, we want to apologize,’ Amazon’s Web Services unit said to conclude its letter.

‘We know how critical our services are to our customers’ businesses and we will do everything we can to learn from this event and use it to drive improvement across our services. As with any significant operational issue, we will spend many hours over the coming days and weeks improving our understanding of the details of the various parts of this event and determining how to make changes to improve our services and processes.’

Among the companies affected by the outages were widely used websites and Web-based services such as Foursquare, HootSuite, Reddit and Quora.

At the root of the outage was a incorrectly performed network change ‘as part of our normal AWS scaling activities’ at a data center in northern Virgina, Amazon said.

‘The configuration change was to upgrade the capacity of the primary network,’ Amazon said in the letter. ‘During the change, one of the standard steps is to shift traffic off of one of the redundant routers’ to allow the upgrade to take place.

But the ‘traffic shift was executed incorrectly and rather than routing the traffic to the other router on the primary network, the traffic was routed onto the lower capacity redundant’ network.

Advertisement

That move not only resulted in a downed primary and secondary server network, the letter said.

‘Traffic was purposely shifted away from the primary network and the secondary network couldn’t handle the traffic level it was receiving,’ Amazon said.

Next up is a self-audit of Amazon’s network change process, as well as a planned increase in ‘automation to prevent this mistake from happening in the future. However, we focus on building software and services to survive failures. Much of the work that will come out of this event.’

In addition to making technical changes, Amazon said it also will improve the way it communicates with its customers.

‘We would like our communications to be more frequent and contain more information,’ the letter said. ‘We understand that during an outage, customers want to know as many details as possible about what’s going on, how long it will take to fix, and what we are doing so that it doesn’t happen again.’

During the cloud service outages, Amazon said it was focused on fixing the problem as quickly as possible and identifying the cause of the problems. The company said it provided updates online to customers when it had new information to offer.

Advertisement

‘That said, we think we can improve in this area,’ Amazon said.

RELATED:

Amazon cloud troubles leave Reddit, other sites down for a second day

Amazon user Indaba pokes fun at cloud troubles in Site Down Jam! contest

Amazon Web Services cloud problems affect Foursquare, HootSuite, Reddit

-- Nathan Olivarez-Giles

Twitter.com/nateog

Advertisement