The Amazon Outage…And the Lessons Learned
At 1:49am EDT yesterday, Amazon’s cloud server (known as AWS) suffered a major and ongoing outage that took down many small and large web services around the world. The NPR has a great, easily understood explanation of what happened:
Major websites including Foursquare and Reddit crashed or suffered slowdowns Thursday after technical problems rattled Amazon.com’s widely used Web servers, frustrating millions of people who couldn’t access their favorite sites.
Read the rest of their article here: http://n.pr/fAMEoG. At the time of this writing, thousands of sites are still affected, as Amazon still has not resolved their issues. Interested clients can visit the Amazon status page directly to see a current status report of the Amazon issues.
While the Amazon outage did not affect any of our hosted websites, nor any of our dedicated or cloud server clients, it did take down our shared mail systems. As we employ third-party monitoring on all of our systems, our team was immediately alerted to the Amazon issues, even before Amazon itself acknowledged there was a problem. We immediately began working with Amazon to find the issues, and quickly discovered that this was not an issue isolated to our mail servers, but instead to wide swaths of the Internet. At that point our team had no choice but to wait on Amazon to provide the expected resolution.
When it became apparent that the problems at Amazon were getting worse, and not better, our team began reviewing our options, and decided to rebuild our mail systems from our last backup. We keep regular nightly backups of our mail systems in a different physical location with Amazon, and thus our backups were unaffected by the Amazon outage. By yesterday afternoon, we had restored all mail services. Because we were forced to rebuild our mail systems from a day-old backup, a few clients may experience a few lost emails. This is unfortunate, and we do apologize for this inconvenience.
Some clients may wonder if there was anything that we could have done to prevent this Amazon outage from affecting our mail systems. While the exact cause and after-action report from Amazon is still not available (they are still trying to fix the problem), some facts are known, and in the balance, the general consensus among Internet experts is that this Amazon outage was unprecedented and was nearly impossible to plan for:
This morning, multiple availability zones failed in the us-east region. AWS broke their promises on the failure scenarios for Availability Zones. It means that AWS have a common single point of failure (assuming it wasn’t a winning-the-lottery-while-being-hit-by-a-meteor-odds coincidence). The sites that are down were correctly designing to the ‘contract’; the problem is that AWS didn’t follow their own specifications.
You can read the rest of this informative article here: http://bit.ly/hPOjkH. However, just because this event was unprecedented and a complete failure of Amazon doesn’t mean that our team didn’t learn valuable lessons. Frequently, it is the largest of failures that teach the best of lessons.
This epic event showed our team that our basic recovery plans were sound and did allow us to recover relatively quickly from this outage. In addition, we have used this event to make adjustments to our disaster recovery procedures to ensure that the next time such an event occurs we will have significantly less downtime.
- Our team will begin taking more frequent backups of our systems. This will allow us to recover from a more recent backup should such a failure even occur again.
- We have changed the way we build our mail systems to allow for a quicker recovery in future events. In this case, it took our team 5 hours to rebuild the mail systems; with these changes already implemented, we expect that future re-builds will take less than an hour.
- We have updated our disaster recovery documentation to reflect the lessons learned during this outage to ensure that our team has the latest and best procedures to follow for future events.
While our team is very disappointed that our clients had to suffer through another outage, we are pleased that we were able to recover services relatively quickly. As we write this, over 16 hours since we recovered our services, major sites like reddit.com are still not fully recovered. As always clients can watch our Trust Site for details on current uptime, and subscribe to our Twitter feed for regular updates for all problems.













or 










